{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "sb_auto_header",
    "tags": [
     "sb_auto_header"
    ]
   },
   "source": [
    "<!-- This cell is automatically updated by tools/tutorial-cell-updater.py -->\n",
    "<!-- The contents are initialized from tutorials/notebook-header.md -->\n",
    "\n",
    "[<img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/>](https://colab.research.google.com/github/speechbrain/speechbrain/blob/develop/docs/tutorials/tasks/speech-recognition-from-scratch.ipynb)\n",
    "to execute or view/download this notebook on\n",
    "[GitHub](https://github.com/speechbrain/speechbrain/tree/develop/docs/tutorials/tasks/speech-recognition-from-scratch.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "uo0JP7a5uFp7"
   },
   "source": [
    "# Speech Recognition From Scratch\n",
    "\n",
    "Ready to dive into the world of building your own speech recognizer using SpeechBrain?\n",
    "\n",
    "You're in luck because this tutorial is what you are looking for! We'll guide you through the whole process of setting up an offline **end-to-end attention-based speech recognizer**.\n",
    "\n",
    "But before we jump in, let's take a quick look at speech recognition and check out the cool techniques that SpeechBrain brings to the table.\n",
    "\n",
    "Let's get started! 🚀\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "nUYxDoJKEk2J"
   },
   "source": [
    "## Overview of Speech Recognition\n",
    "In the figure, we show an example of a typical speech recognition pipeline used in SpeechBrain:\n",
    "\n",
    "\n",
    "![SpeechBrain-Page-2.png]()\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "OyJ0gXjyG9sk"
   },
   "source": [
    "The speech recognition process begins with the **raw waveform directly 🎤**.\n",
    "\n",
    "The original waveform undergoes contamination through various **speech augmentation techniques**, such as *time/frequency dropout*, *speed change*, *adding noise*, *reverberation*, etc. These disturbances are activated randomly based on user-specified probabilities and are applied **on-the-fly** without the need to store augmented signals on disk.\n",
    "\n",
    "For a deeper understanding of the contamination techniques, check out our tutorials on [speech augmentation](https://speechbrain.readthedocs.io/en/latest/tutorials/preprocessing/speech-augmentation.html) and [environmental corruption](https://speechbrain.readthedocs.io/en/latest/tutorials/preprocessing/environmental-corruption.html).\n",
    "\n",
    "Next, we extract **speech features**, such as *Short-Term Fourier Transform (STFT)*, *spectrograms*, *FBANKs*, and *MFCCs*. Thanks to a highly efficient GPU-friendly implementation, these features can be computed on the fly.\n",
    "\n",
    "For more detailed information, refer to our tutorials on [speech representation](https://speechbrain.readthedocs.io/en/latest/tutorials/preprocessing/fourier-transform-and-spectrograms.html) and [speech features](https://speechbrain.readthedocs.io/en/latest/tutorials/preprocessing/speech-features.html).\n",
    "\n",
    "Subsequently, the features are fed into the **speech recognizer**, a neural network mapping input feature sequences to output token sequences (e.g., phonemes, characters, subwords, words). SpeechBrain supports popular techniques like Connectionist Temporal Classification (CTC), Transducers, or Encoder/Decoder with attention (using both RNN- and Transformer-based systems).\n",
    "\n",
    "Posterior probabilities over output tokens are processed by a beamsearcher that explores alternatives and outputs the best one. Optionally, alternatives can be rescored with an external language model, which may be based on RNN or transformers 🤖.\n",
    "\n",
    "Not all modules mentioned are mandatory; for example, data contamination can be skipped if not helpful for a specific task. Even beam search can be replaced with a greedy search for fast decoding.\n",
    "\n",
    "Now, let's delve into a more detailed discussion of the different technologies supported for speech recognition: 🚀\n",
    "\n",
    "![SpeechBrain-Page-3.png]()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "yXy6DhQmhrYA"
   },
   "source": [
    "### Connectionist Temporal Classification (CTC)\n",
    "\n",
    "CTC stands out as the simplest speech recognition system within SpeechBrain.\n",
    "\n",
    "At each time step, it produces a prediction. CTC introduces a unique token, *blank*, enabling the network to output nothing when uncertain. The CTC cost function employs **dynamic programming** to align across all possible alignments.\n",
    "\n",
    "For each alignment, a corresponding probability can be computed. The ultimate CTC cost is the sum of the probabilities of all possible alignments, efficiently calculated using the forward algorithm (distinct from the one used in neural networks, as described in Hidden Markov Model literature).\n",
    "\n",
    "In encoder-decoder architectures, attention is used to learn the alignment between input-output sequences. In CTC, alignment isn't learned; instead, integration occurs over all possible alignments.\n",
    "\n",
    "Essentially, CTC implementation involves incorporating a specialized cost function atop the speech recognizer, often based on recurrent neural networks (RNNs), although not exclusively. 🧠\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "VETVavnMvnar"
   },
   "source": [
    "### Transducers\n",
    "\n",
    "In the depicted figure, Transducers enhance CTC by introducing an autoregressive predictor and a join network.\n",
    "\n",
    "An encoder converts input features into a sequence of encoded representations. The predictor, on the other hand, generates a latent representation based on previously emitted outputs. A join network amalgamates these two, and a softmax classifier predicts the current output token. During training, CTC loss is applied after the classifier.\n",
    "\n",
    "For more in-depth insights into Transducers, check out this informative tutorial by Loren Lugosch: [Transducer Tutorial](https://lorenlugosch.github.io/posts/2020/11/transducer/) 📚."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "gzNb1cZzvxUh"
   },
   "source": [
    "### Encoder-Decoder with Attention 👂\n",
    "\n",
    "Another widely-used approach in speech recognition involves employing an encoder-decoder architecture.\n",
    "\n",
    "- The **encoder** processes a sequence of speech features (or raw samples directly) to generate a sequence of states, denoted as h.\n",
    "- The **decoder** utilizes the last hidden state and produces N output tokens. Typically, the decoder is autoregressive, with the previous output fed back into the input. Decoding halts upon predicting the end-of-sentence (eos) token.\n",
    "- Encoders and decoders can be constructed using various neural architectures, such as RNNs, CNNs, Transformers, or combinations of them.\n",
    "\n",
    "The inclusion of **attention** facilitates dynamic connections between encoder and decoder states. SpeechBrain supports different attention types, including *content* or *location-aware* for RNN-based systems and *key-value*-based for Transformers. As a convergence enhancement, a CTC loss is often applied atop the encoder. 🚀\n",
    "\n",
    "This architecture provides flexibility and adaptability, allowing for effective speech recognition across diverse applications."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "bq7zSEHXexqC"
   },
   "source": [
    "### Beamsearch\n",
    "The beamsearcher employed in encoder-decoder models follows an autoregressive process. Here's how it operates:\n",
    "\n",
    "1. Initialization: The process begins with the <bos> (beginning-of-sequence) token.\n",
    "2. Prediction: The model predicts the N most promising next tokens based on the current input.\n",
    "3. Feeding Alternatives: These N alternatives are fed into the decoder to generate future hypotheses.\n",
    "4. Selection: The best N hypotheses are chosen based on certain criteria or scoring mechanisms.\n",
    "5. Iteration: The loop continues until the <eos> (end-of-sequence) token is predicted.\n",
    "\n",
    "\n",
    "![SpeechBrain-Page-2 (1).png]()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "8h_cjNwOv4L1"
   },
   "source": [
    "We encourage the readers not familiar enough with speech recognition to gain more familiarity with this technology before moving on. Beyond scientific papers, online you can find amazing tutorials and blog posts, such as:\n",
    "- [An Intuitive Explanation of Connectionist Temporal Classification](https://towardsdatascience.com/intuitively-understanding-connectionist-temporal-classification-3797e43a86c)\n",
    "- [Connectionist Temporal Classification](https://web.archive.org/web/20211017041333/https://machinelearning-blog.com/2018/09/05/753/)\n",
    "- [Sequence-to-sequence learning with Transducers](https://lorenlugosch.github.io/posts/2020/11/transducer/)\n",
    "- [Understanding Encoder-Decoder Sequence to Sequence Model](https://towardsdatascience.com/understanding-encoder-decoder-sequence-to-sequence-model-679e04af4346)\n",
    "- [What is a Transformer?](https://medium.com/inside-machine-learning/what-is-a-transformer-d07dd1fbec04)\n",
    "- [Attention? Attention!](https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html)\n",
    "- [Attention and its Different Forms](https://towardsdatascience.com/attention-and-its-different-forms-7fc3674d14dc)\n",
    "- [How to Implement a Beam Search Decoder for Natural Language Processing](https://machinelearningmastery.com/beam-search-decoder-natural-language-processing/)\n",
    "- [An intuitive explanation of Beam Search](https://towardsdatascience.com/an-intuitive-explanation-of-beam-search-9b1d744e7a0f)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "t3vzfilfMros"
   },
   "source": [
    "After this brief overview let's now see how we can develop a speech recognition system (encoder-decoder + CTC) with SpeechBrain.\n",
    "\n",
    "For simplicity, training will be done with a small open-source dataset called [mini-librispeech](https://www.openslr.org/31/), which only contains few hours of training data. In a real case, you need much more training material (e.g 100 or even 1000 hours) to reach acceptable performance."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "7M6IoxLlEovh"
   },
   "source": [
    "## Installation\n",
    "\n",
    "To run the code fast enough, we suggest using a GPU (`Runtime => change runtime type => GPU`). In this tutorial, we will refer to the code in ```speechbrain/templates/ASR```.\n",
    "\n",
    "Before starting, let's install speechbrain:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "beagAGw5t5bK"
   },
   "outputs": [],
   "source": [
    "%%capture\n",
    "# Installing SpeechBrain via pip\n",
    "BRANCH = 'develop'\n",
    "!python -m pip install git+https://github.com/speechbrain/speechbrain.git@$BRANCH\n",
    "\n",
    "# Clone SpeechBrain repository\n",
    "!git clone https://github.com/speechbrain/speechbrain/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "eWpl9xgAIXKE"
   },
   "source": [
    "## Which steps are needed?\n",
    "\n",
    "### 1. Prepare Your Data \n",
    "   - Create data manifest files (CSV or JSON format) specifying the location of speech data and corresponding text annotations.\n",
    "   - Utilize tools like [mini_librispeech_prepare.py](https://github.com/speechbrain/speechbrain/blob/develop/templates/speech_recognition/mini_librispeech_prepare.py) to generate these manifest files.\n",
    "\n",
    "### 2. Train a Tokenizer \n",
    "   - Decide on basic units for training the speech recognizer and language model (e.g., characters, phonemes, sub-words, words).\n",
    "   - Execute the tokenizer training script:\n",
    "     ```bash\n",
    "     cd speechbrain/templates/speech_recognition/Tokenizer\n",
    "     python train.py tokenizer.yaml\n",
    "     ```\n",
    "\n",
    "### 3. Train a Language Model \n",
    "   - Train a language model using a large text corpus (preferably within the same language domain as your target application).\n",
    "   - Example training script for a language model:\n",
    "     ```bash\n",
    "     pip install datasets\n",
    "     cd speechbrain/templates/speech_recognition/LM\n",
    "     python train.py RNNLM.yaml\n",
    "     ```\n",
    "\n",
    "### 4. Train the Speech Recognizer \n",
    "   - Train the speech recognizer using a chosen model (e.g., CRDNN) with an autoregressive GRU decoder and attention mechanism.\n",
    "   - Employ beamsearch along with the trained language model for sequence generation:\n",
    "     ```bash\n",
    "     cd speechbrain/templates/speech_recognition/ASR\n",
    "     python train.py train.yaml\n",
    "     ```\n",
    "\n",
    "### 5. Use the Speech Recognizer (Inference) \n",
    "   - After training, deploy the trained speech recognizer for inference.\n",
    "   - Leverage classes like EncoderDecoderASR in SpeechBrain to simplify the inference process.\n",
    "\n",
    "Each step is crucial for building an effective end-to-end speech recognizer.\n",
    "\n",
    "We will now provide a detailed description of all these steps.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "rDgNu_b8k6qD"
   },
   "source": [
    "## Step 1: Prepare Your Data \n",
    "\n",
    "Data preparation is a critical initial step in training an end-to-end speech recognizer. Its primary objective is to generate data manifest files, which instruct SpeechBrain on the locations of audio data and their corresponding transcriptions. These manifest files, written in widely-used CSV and JSON formats, play a crucial role in organizing the training process.\n",
    "\n",
    "### Data Manifest Files\n",
    "\n",
    "Let's delve into the structure of a data manifest file in JSON format:\n",
    "\n",
    "```json\n",
    "{\n",
    "  \"1867-154075-0032\": {\n",
    "    \"wav\": \"{data_root}/LibriSpeech/train-clean-5/1867/154075/1867-154075-0032.flac\",\n",
    "    \"length\": 16.09,\n",
    "    \"words\": \"AND HE BRUSHED A HAND ACROSS HIS FOREHEAD AND WAS INSTANTLY HIMSELF CALM AND COOL VERY WELL THEN IT SEEMS I'VE MADE AN ASS OF MYSELF BUT I'LL TRY TO MAKE UP FOR IT NOW WHAT ABOUT CAROLINE\"\n",
    "  },\n",
    "  \"1867-154075-0001\": {\n",
    "    \"wav\": \"{data_root}/LibriSpeech/train-clean-5/1867/154075/1867-154075-0001.flac\",\n",
    "    \"length\": 14.9,\n",
    "    \"words\": \"THAT DROPPED HIM INTO THE COAL BIN DID HE GET COAL DUST ON HIS SHOES RIGHT AND HE DIDN'T HAVE SENSE ENOUGH TO WIPE IT OFF AN AMATEUR A RANK AMATEUR I TOLD YOU SAID THE MAN OF THE SNEER WITH SATISFACTION\"\n",
    "  },\n",
    "  \"1867-154075-0028\": {\n",
    "    \"wav\": \"{data_root}/LibriSpeech/train-clean-5/1867/154075/1867-154075-0028.flac\",\n",
    "    \"length\": 16.41,\n",
    "    \"words\": \"MY NAME IS JOHN MARK I'M DOONE SOME CALL ME RONICKY DOONE I'M GLAD TO KNOW YOU RONICKY DOONE I IMAGINE THAT NAME FITS YOU NOW TELL ME THE STORY OF WHY YOU CAME TO THIS HOUSE OF COURSE IT WASN'T TO SEE A GIRL\"\n",
    "  },\n",
    "}\n",
    "```\n",
    "\n",
    "This structure follows a hierarchical format where the unique identifier of the spoken sentence serves as the first key. Key fields such as the path of the speech recording, its length in seconds, and the sequence of words uttered are specified for each entry.\n",
    "\n",
    "A special variable, `data_root`, allows dynamic changes to the data folder from the command line or the YAML hyperparameter file.\n",
    "\n",
    "### Preparation Script\n",
    "\n",
    "Creating a preparation script for your specific dataset is essential, considering that each dataset has its own format. For instance, the [mini_librispeech_prepare.py](https://github.com/speechbrain/speechbrain/blob/develop/templates/speech_recognition/mini_librispeech_prepare.py) script, tailored for the mini-librispeech dataset, serves as a foundational template. This script automatically downloads publicly available data, searches for audio files and transcriptions, and creates the JSON file.\n",
    "\n",
    "Use this script as a starting point for custom data preparation on your target dataset. It offers a practical guide for organizing training, validation, and test phases through three separate data manifest files.\n",
    "\n",
    "### Copy Your Data Locally\n",
    "\n",
    "In an HPC cluster or similar environments, optimizing code performance involves copying data to the local folder of the computing node. While not applicable in Google Colab, this practice significantly accelerates code execution by fetching data from the local filesystem instead of the shared one.\n",
    "\n",
    "Take note of these considerations as you embark on the crucial journey of data preparation for training your speech recognizer. 🚀🎙️\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "9To_-2fej2SA"
   },
   "source": [
    "## Step 2: Tokenizer\n",
    "\n",
    "Choosing the basic tokens for your speech recognizer is a critical decision that impacts the model's performance. You have several options, each with its own set of advantages and challenges.\n",
    "\n",
    "### Using Characters as Tokens\n",
    "One straightforward approach is to predict characters, converting the sequence of words into a sequence of characters. For example:\n",
    "```\n",
    "THE CITY OF MONTREAL => ['T','H','E', '_', 'C','I','T','Y','_', 'O', 'F', '_, 'M','O','N','T','R','E','A','L']\n",
    "```\n",
    "Advantages and disadvantages of this approach include a small total number of tokens, the chance to generalize to unseen words, and the challenge of predicting long sequences.\n",
    "\n",
    "### Using Words as Tokens\n",
    "Predicting full words is another option:\n",
    "```\n",
    "THE CITY OF MONTREAL => ['THE','CITY','OF','MONTREAL']\n",
    "```\n",
    "Advantages include short output sequences, but the system can't generalize to new words, and tokens with little training material may be allocated.\n",
    "\n",
    "### Byte Pair Encoding (BPE) Tokens\n",
    "A middle ground is Byte Pair Encoding (BPE), a technique inherited from data compression. It allocates tokens for the most frequent sequences of characters:\n",
    "```\n",
    "THE CITY OF MONTREAL => ['THE', '▁CITY', '▁OF', '▁MO', 'NT', 'RE', 'AL']\n",
    "```\n",
    "BPE finds tokens based on the most frequent character pairs, allowing for flexibility in token length.\n",
    "\n",
    "#### How Many BPE Tokens?\n",
    "The number of tokens is a hyperparameter that depends on the available speech data. For reference, 1k to 10k tokens are reasonable for datasets like LibriSpeech (1000 hours of English sentences).\n",
    "\n",
    "### Train a Tokenizer\n",
    "SpeechBrain leverages [SentencePiece](https://github.com/google/sentencepiece) for tokenization. To find the tokens for your training transcriptions, run the following code:\n",
    "\n",
    "```bash\n",
    "cd speechbrain/templates/speech_recognition/Tokenizer\n",
    "python train.py tokenizer.yaml\n",
    "```\n",
    "\n",
    "This step is crucial in shaping the behavior of your speech recognizer. Experiment with different tokenization strategies to find the one that best suits your dataset and objectives. 🚀🔍\n",
    "\n",
    "Let's train the tokenizer:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "tA4HMrnFJ33e"
   },
   "outputs": [],
   "source": [
    "%cd /content/speechbrain/templates/speech_recognition/Tokenizer\n",
    "!python train.py tokenizer.yaml"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "PYko19NiKdtK"
   },
   "source": [
    "The code might take a while just because data are downloaded and prepared. As for all the other recipes in SpeechBrain, we have a training script (`train.py`) and a hyperparameter file (`tokenizer.yaml`). Let's take a closer look into the latter first:\n",
    "\n",
    "\n",
    "\n",
    "```yaml\n",
    "# ############################################################################\n",
    "# Tokenizer: subword BPE tokenizer with unigram 1K\n",
    "# Training: Mini-LibriSpeech\n",
    "# Authors:  Abdel Heba 2021\n",
    "#           Mirco Ravanelli 2021\n",
    "# ############################################################################\n",
    "\n",
    "\n",
    "# Set up folders for reading from and writing to\n",
    "data_folder: ../data\n",
    "output_folder: ./save\n",
    "\n",
    "# Path where data-specification files are stored\n",
    "train_annotation: ../train.json\n",
    "valid_annotation: ../valid.json\n",
    "test_annotation: ../test.json\n",
    "\n",
    "# Tokenizer parameters\n",
    "token_type: unigram  # [\"unigram\", \"bpe\", \"char\"]\n",
    "token_output: 1000  # index(blank/eos/bos/unk) = 0\n",
    "character_coverage: 1.0\n",
    "annotation_read: words # field to read\n",
    "\n",
    "# Tokenizer object\n",
    "tokenizer: !name:speechbrain.tokenizers.SentencePiece.SentencePiece\n",
    "   model_dir: !ref <output_folder>\n",
    "   vocab_size: !ref <token_output>\n",
    "   annotation_train: !ref <train_annotation>\n",
    "   annotation_read: !ref <annotation_read>\n",
    "   model_type: !ref <token_type> # [\"unigram\", \"bpe\", \"char\"]\n",
    "   character_coverage: !ref <character_coverage>\n",
    "   annotation_list_to_check: [!ref <train_annotation>, !ref <valid_annotation>]\n",
    "   annotation_format: json\n",
    "```\n",
    "\n",
    "The tokenizer is trained on training annotations only. We set here a vocabulary size of 1000. Instead of using the standard BPE algorithm, we use a variation of it based on unigram smoothing. See [sentencepiece](https://github.com/google/sentencepiece) for more info.\n",
    "The tokenizer will be saved in the specified `output_folder`.\n",
    "\n",
    "Let's now take a look into the training script `train.py`:\n",
    "\n",
    "\n",
    "\n",
    "```python\n",
    "if __name__ == \"__main__\":\n",
    "\n",
    "    # Load hyperparameters file with command-line overrides\n",
    "    hparams_file, run_opts, overrides = sb.parse_arguments(sys.argv[1:])\n",
    "    with open(hparams_file) as fin:\n",
    "        hparams = load_hyperpyyaml(fin, overrides)\n",
    "\n",
    "    # Create experiment directory\n",
    "    sb.create_experiment_directory(\n",
    "        experiment_directory=hparams[\"output_folder\"],\n",
    "        hyperparams_to_save=hparams_file,\n",
    "        overrides=overrides,\n",
    "    )\n",
    "\n",
    "    # Data preparation, to be run on only one process.\n",
    "    prepare_mini_librispeech(\n",
    "        data_folder=hparams[\"data_folder\"],\n",
    "        save_json_train=hparams[\"train_annotation\"],\n",
    "        save_json_valid=hparams[\"valid_annotation\"],\n",
    "        save_json_test=hparams[\"test_annotation\"],\n",
    "    )\n",
    "\n",
    "    # Train tokenizer\n",
    "    hparams[\"tokenizer\"]()\n",
    "```\n",
    "\n",
    "Essentially, we prepare the data with the `prepare_mini_librispeech` script and we then run the sentencepiece tokenizer wrapped in\n",
    "`speechbrain.tokenizers.SentencePiece.SentencePiece`.\n",
    "\n",
    "Let's take a look at the files generated by the tokenizer. If you go into the specified output folder (`Tokenizer/save`), you can find two files:\n",
    "+ *1000_unigram.model*\n",
    "+ *1000_unigram.vocab*\n",
    "\n",
    "The first is a binary file containing all the information needed for tokenizing an input text. The second is a text file reporting the list of tokens allocated (with their log probabilities):\n",
    "\n",
    "```\n",
    "▁THE  -3.2458\n",
    "S -3.36618\n",
    "ED  -3.84476\n",
    "▁ -3.91777\n",
    "E -3.92101\n",
    "▁AND  -3.92316\n",
    "▁A  -3.97359\n",
    "▁TO -4.00462\n",
    "▁OF -4.08116\n",
    "....\n",
    "```\n",
    "\n",
    "Let me now show how we can use the learned model to tokenize a text:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "Ik9hoxBUG03u"
   },
   "outputs": [],
   "source": [
    "import torch\n",
    "import sentencepiece as spm\n",
    "sp = spm.SentencePieceProcessor()\n",
    "sp.load(\"/content/speechbrain/templates/speech_recognition/Tokenizer/save/1000_unigram.model\")\n",
    "\n",
    "# Encode as pieces\n",
    "print(sp.encode_as_pieces('THE CITY OF MONTREAL'))\n",
    "\n",
    "# Encode as ids\n",
    "print(sp.encode_as_ids('THE CITY OF MONTREAL'))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Oft-7K85LA86"
   },
   "source": [
    "Note that the sentencepiece tokenizers also assign a unique index to each allocated token. These indexes will correspond to the output of our neural networks for language models and ASR."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "nkYENC7BJ4K9"
   },
   "source": [
    "## Step 3: Train a Language Model \n",
    "\n",
    "A Language Model (LM) plays a crucial role in enhancing the performance of a speech recognizer. In this tutorial, we adopt the concept of **shallow fusion**, incorporating language information within the beam searcher of the speech recognizer to rescore partial hypotheses. This involves scoring the partial hypotheses provided by the speech recognizer with language scores, penalizing sequences of tokens that are \"unlikely\" to be observed.\n",
    "\n",
    "### Text Corpus\n",
    "Training a language model typically involves using large text corpora, predicting the most probable next token. If you lack a substantial text corpus for your application, you may choose to skip this part. Additionally, training a language model on a large text corpus is computationally demanding, so consider leveraging pre-trained models and fine-tuning if needed.\n",
    "\n",
    "For the purposes of this tutorial, we train a language model on the training transcriptions of mini-librispeech. Keep in mind that this is a simplified demonstration for educational purposes.\n",
    "\n",
    "### Train a LM\n",
    "\n",
    "We are going to train a simple RNN-based language model that estimates the next tokens given the previous ones.\n",
    "\n",
    "![SpeechBrain-Page-3 (1).png]()\n",
    "\n",
    "To train it, run the following code:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "AtMw7x0ybFlI"
   },
   "outputs": [],
   "source": [
    "!pip install datasets\n",
    "%cd /content/speechbrain/templates/speech_recognition/LM\n",
    "!python train.py RNNLM.yaml #--device='cpu'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "b3tnXnrWc2My"
   },
   "source": [
    "As evident from the output, both training and validation losses exhibit a consistent decrease over time.\n",
    "\n",
    "Before delving into the code, let's explore the contents generated within the specified `output_folder`:\n",
    "\n",
    "*   `train_log.txt`: This file comprises statistics (e.g., train_loss, valid_loss) computed at each epoch.\n",
    "*   `log.txt`: A detailed logger providing timestamps for each fundamental operation.\n",
    "*   `env.log`: Displays all dependencies used along with their respective versions, facilitating replicability.\n",
    "\n",
    "*   `train.py`, `hyperparams.yaml`: Copies of the experiment file along with corresponding hyperparameters, crucial for ensuring replicability.\n",
    "\n",
    "*   `save`: The repository where the learned model is stored.\n",
    "\n",
    "Within the `save` folder, subfolders contain checkpoints saved during training, formatted as `CKPT+data+time`. Typically, two checkpoints reside here: the best (i.e., the oldest, representing optimal performance) and the latest (i.e., the most recent). If a single checkpoint is present, it indicates that the last epoch is also the best.\n",
    "\n",
    "Each checkpoint folder encompasses all information necessary for resuming training, including models, optimizers, schedulers, epoch counters, etc. The parameters of the RNNLM model are stored in the `model.ckpt` file, utilizing a binary format readable with `torch.load`.\n",
    "\n",
    "The hyperparameters section of the tutorial provides a comprehensive overview of the settings used for training the language model. Here's a refined version of the explanation:\n",
    "\n",
    "### Hyperparameters\n",
    "\n",
    "For a detailed look at the complete `RNNLM.yaml` file, please refer to [this link](https://github.com/speechbrain/speechbrain/blob/develop/templates/speech_recognition/LM/RNNLM.yaml).\n",
    "\n",
    "In the initial section, fundamental configurations such as the random seed, output folder paths, and training logger are defined:\n",
    "\n",
    "```yaml\n",
    "seed: 2602\n",
    "__set_seed: !apply:torch.manual_seed [!ref <seed>]\n",
    "output_folder: !ref results/RNNLM/\n",
    "save_folder: !ref <output_folder>/save\n",
    "train_log: !ref <output_folder>/train_log.txt\n",
    "```\n",
    "\n",
    "The subsequent segment outlines the paths for the text corpora used in training, validation, and testing:\n",
    "\n",
    "```yaml\n",
    "lm_train_data: data/train.txt\n",
    "lm_valid_data: data/valid.txt\n",
    "lm_test_data: data/test.txt\n",
    "```\n",
    "\n",
    "Unlike other recipes, the Language Model (LM) directly processes large raw text corpora without the need for JSON/CSV files, leveraging the [HuggingFace dataset](https://huggingface.co/) for efficiency.\n",
    "\n",
    "Following this, the setup for the train logger and the specification of the tokenizer (utilizing the one trained in the previous step) are detailed:\n",
    "\n",
    "```yaml\n",
    "train_logger: !new:speechbrain.utils.train_logger.FileTrainLogger\n",
    "    save_file: !ref <train_log>\n",
    "\n",
    "tokenizer_file: ../Tokenizer/save/1000_unigram.model\n",
    "```\n",
    "\n",
    "Moving on, essential training hyperparameters, including epochs, batch size, and learning rate, are defined, along with critical architectural parameters such as embedding dimension, RNN size, layers, and output dimensionality:\n",
    "\n",
    "```yaml\n",
    "number_of_epochs: 20\n",
    "batch_size: 80\n",
    "lr: 0.001\n",
    "accu_steps: 1\n",
    "ckpt_interval_minutes: 15\n",
    "\n",
    "emb_dim: 256\n",
    "rnn_size: 512\n",
    "layers: 2\n",
    "output_neurons: 1000\n",
    "```\n",
    "\n",
    "Subsequently, the objects for training the language model are introduced, encompassing the RNN model, cost function, optimizer, and learning rate scheduler:\n",
    "\n",
    "```yaml\n",
    "model: !new:templates.speech_recognition.LM.custom_model.CustomModel\n",
    "    embedding_dim: !ref <emb_dim>\n",
    "    rnn_size: !ref <rnn_size>\n",
    "    layers: !ref <layers>\n",
    "\n",
    "compute_cost: !name:speechbrain.nnet.losses.nll_loss\n",
    "\n",
    "optimizer: !name:torch.optim.Adam\n",
    "    lr: !ref <lr>\n",
    "    betas: (0.9, 0.98)\n",
    "    eps: 0.000000001\n",
    "\n",
    "lr_annealing: !new:speechbrain.nnet.schedulers.NewBobScheduler\n",
    "    initial_value: !ref <lr>\n",
    "    improvement_threshold: 0.0025\n",
    "    annealing_factor: 0.8\n",
    "    patient: 0\n",
    "```\n",
    "\n",
    "The YAML file concludes with the specification of the epoch counter, tokenizer, and checkpointer:\n",
    "\n",
    "```yaml\n",
    "epoch_counter: !new:speechbrain.utils.epoch_loop.EpochCounter\n",
    "    limit: !ref <number_of_epochs>\n",
    "\n",
    "modules:\n",
    "    model: !ref <model>\n",
    "\n",
    "tokenizer: !new:sentencepiece.SentencePieceProcessor\n",
    "\n",
    "checkpointer: !new:speechbrain.utils.checkpoints.Checkpointer\n",
    "    checkpoints_dir: !ref <save_folder>\n",
    "    recoverables:\n",
    "        model: !ref <model>\n",
    "        scheduler: !ref <lr_annealing>\n",
    "        counter: !ref <epoch_counter>\n",
    "\n",
    "pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainer\n",
    "    loadables:\n",
    "        tokenizer: !ref <tokenizer>\n",
    "    paths:\n",
    "        tokenizer: !ref <tokenizer_file>\n",
    "```\n",
    "\n",
    "The pre-trainer class facilitates the connection between the tokenizer object and the pre-trained tokenizer file.\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "mnCM5xuy85P4"
   },
   "source": [
    "### Experiment file\n",
    "Let's now take a look into how the objects, functions, and hyperparameters declared in the yaml file are used in `train.py` to implement the language model.\n",
    "\n",
    "Let's start from the main of the `train.py`:\n",
    "\n",
    "\n",
    "```python\n",
    "# Recipe begins!\n",
    "if __name__ == \"__main__\":\n",
    "\n",
    "    # Reading command line arguments\n",
    "    hparams_file, run_opts, overrides = sb.parse_arguments(sys.argv[1:])\n",
    "\n",
    "    # Initialize ddp (useful only for multi-GPU DDP training)\n",
    "    sb.utils.distributed.ddp_init_group(run_opts)\n",
    "\n",
    "    # Load hyperparameters file with command-line overrides\n",
    "    with open(hparams_file) as fin:\n",
    "        hparams = load_hyperpyyaml(fin, overrides)\n",
    "\n",
    "    # Create experiment directory\n",
    "    sb.create_experiment_directory(\n",
    "        experiment_directory=hparams[\"output_folder\"],\n",
    "        hyperparams_to_save=hparams_file,\n",
    "        overrides=overrides,\n",
    "    )\n",
    "```\n",
    "\n",
    "We here do some preliminary operations such as parsing the command line, initializing the distributed data-parallel (needed if multiple GPUs are used), creating the output folder, and reading the yaml file.\n",
    "\n",
    "After reading the yaml file with `load_hyperpyyaml`, all the objects declared in the hyperparameter files are initialized and available in a dictionary form (along with the other functions and parameters reported in the yaml file).\n",
    "For instance,  we will have `hparams['model']`, `hparams['optimizer']`, `hparams['batch_size']`, etc.\n",
    "\n",
    "\n",
    "#### Data-IO Pipeline\n",
    "We then call a special function that creates the dataset objects for training, validation, and test.\n",
    "\n",
    "```python\n",
    "    # Create dataset objects \"train\", \"valid\", and \"test\"\n",
    "    train_data, valid_data, test_data = dataio_prepare(hparams)\n",
    "```\n",
    "\n",
    "Let's take a closer look into that.\n",
    "\n",
    "\n",
    "```python\n",
    "def dataio_prepare(hparams):\n",
    "    \"\"\"This function prepares the datasets to be used in the brain class.\n",
    "    It also defines the data processing pipeline through user-defined functions.\n",
    "\n",
    "    The language model is trained with the text files specified by the user in\n",
    "    the hyperparameter file.\n",
    "\n",
    "    Arguments\n",
    "    ---------\n",
    "    hparams : dict\n",
    "        This dictionary is loaded from the `train.yaml` file, and it includes\n",
    "        all the hyperparameters needed for dataset construction and loading.\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    datasets : list\n",
    "        List containing \"train\", \"valid\", and \"test\" sets that correspond\n",
    "        to the appropriate DynamicItemDataset object.\n",
    "    \"\"\"\n",
    "\n",
    "    logging.info(\"generating datasets...\")\n",
    "\n",
    "    # Prepare datasets\n",
    "    datasets = load_dataset(\n",
    "        \"text\",\n",
    "        data_files={\n",
    "            \"train\": hparams[\"lm_train_data\"],\n",
    "            \"valid\": hparams[\"lm_valid_data\"],\n",
    "            \"test\": hparams[\"lm_test_data\"],\n",
    "        },\n",
    "    )\n",
    "\n",
    "    # Convert huggingface's dataset to DynamicItemDataset via a magical function\n",
    "    train_data = sb.dataio.dataset.DynamicItemDataset.from_arrow_dataset(\n",
    "        datasets[\"train\"]\n",
    "    )\n",
    "    valid_data = sb.dataio.dataset.DynamicItemDataset.from_arrow_dataset(\n",
    "        datasets[\"valid\"]\n",
    "    )\n",
    "    test_data = sb.dataio.dataset.DynamicItemDataset.from_arrow_dataset(\n",
    "        datasets[\"test\"]\n",
    "    )\n",
    "\n",
    "    datasets = [train_data, valid_data, test_data]\n",
    "    tokenizer = hparams[\"tokenizer\"]\n",
    "\n",
    "    # Define text processing pipeline. We start from the raw text and then\n",
    "    # encode it using the tokenizer. The tokens with bos are used for feeding\n",
    "    # the neural network, the tokens with eos for computing the cost function.\n",
    "    @sb.utils.data_pipeline.takes(\"text\")\n",
    "    @sb.utils.data_pipeline.provides(\"text\", \"tokens_bos\", \"tokens_eos\")\n",
    "    def text_pipeline(text):\n",
    "        yield text\n",
    "        tokens_list = tokenizer.encode_as_ids(text)\n",
    "        tokens_bos = torch.LongTensor([hparams[\"bos_index\"]] + (tokens_list))\n",
    "        yield tokens_bos\n",
    "        tokens_eos = torch.LongTensor(tokens_list + [hparams[\"eos_index\"]])\n",
    "        yield tokens_eos\n",
    "\n",
    "    sb.dataio.dataset.add_dynamic_item(datasets, text_pipeline)\n",
    "\n",
    "    # 4. Set outputs to add into the batch. The batch variable will contain\n",
    "    # all these fields (e.g, batch.id, batch.text, batch.tokens.bos,..)\n",
    "    sb.dataio.dataset.set_output_keys(\n",
    "        datasets, [\"id\", \"text\", \"tokens_bos\", \"tokens_eos\"],\n",
    "    )\n",
    "    return train_data, valid_data, test_data\n",
    "```\n",
    "\n",
    "The first part is just a conversion from the HuggingFace dataset to the DynamicItemDataset used in SpeechBrain.\n",
    "\n",
    "You can notice that we expose the text processing function `text_pipeline`, which takes in input the text of one sentence and processes it in different ways.\n",
    "\n",
    "The text processing function converts the raw text into the corresponding tokens (in index form). We also create other variables such as the version of the sequence with the beginning of the sentence `<bos>`  token in front and the one with the end of sentence `<eos>` as the last element. Their usefulness will be clear later.\n",
    "\n",
    "Before returning the dataset objects, the `dataio_prepare` specifies which keys we would like to output. As we will see later, these keys will be available in the brain class as `batch.id`, `batch.text`, `batch.tokens_bos`, etc.\n",
    "[For more information on the data loader, please take a look into this tutorial](https://speechbrain.readthedocs.io/en/latest/tutorials/basics/data-loading-pipeline.html)\n",
    "\n",
    "\n",
    "After the definition of the datasets, the main function can go ahead with the  initialization of the brain class:\n",
    "\n",
    "```python\n",
    "    # Initialize the Brain object to prepare for LM training.\n",
    "    lm_brain = LM(\n",
    "        modules=hparams[\"modules\"],\n",
    "        opt_class=hparams[\"optimizer\"],\n",
    "        hparams=hparams,\n",
    "        run_opts=run_opts,\n",
    "        checkpointer=hparams[\"checkpointer\"],\n",
    "    )\n",
    "```\n",
    "The brain class implements all the functionalities needed for supporting the training and validation loops.  Its `fit` and `evaluate` methods perform training and test, respectively:\n",
    "\n",
    "```python\n",
    "    lm_brain.fit(\n",
    "        lm_brain.hparams.epoch_counter,\n",
    "        train_data,\n",
    "        valid_data,\n",
    "        train_loader_kwargs=hparams[\"train_dataloader_opts\"],\n",
    "        valid_loader_kwargs=hparams[\"valid_dataloader_opts\"],\n",
    "    )\n",
    "\n",
    "    # Load best checkpoint for evaluation\n",
    "    test_stats = lm_brain.evaluate(\n",
    "        test_data,\n",
    "        min_key=\"loss\",\n",
    "        test_loader_kwargs=hparams[\"test_dataloader_opts\"],\n",
    "    )\n",
    "```\n",
    "The training and validation data loaders are given in input to the fit method, while the test dataset is fed into the evaluate method.\n",
    "\n",
    "Let's now take a look into the most important methods defined in the brain class.\n",
    "\n",
    "#### Forward Computations\n",
    "\n",
    "Let's start with the `forward` function, which defines all the computations needed to transform the input text into the output predictions.\n",
    "\n",
    "\n",
    "```python\n",
    "    def compute_forward(self, batch, stage):\n",
    "        \"\"\"Predicts the next word given the previous ones.\n",
    "\n",
    "        Arguments\n",
    "        ---------\n",
    "        batch : PaddedBatch\n",
    "            This batch object contains all the relevant tensors for computation.\n",
    "        stage : sb.Stage\n",
    "            One of sb.Stage.TRAIN, sb.Stage.VALID, or sb.Stage.TEST.\n",
    "\n",
    "        Returns\n",
    "        -------\n",
    "        predictions : torch.Tensor\n",
    "            A tensor containing the posterior probabilities (predictions).\n",
    "        \"\"\"\n",
    "        batch = batch.to(self.device)\n",
    "        tokens_bos, _ = batch.tokens_bos\n",
    "        pred = self.hparams.model(tokens_bos)\n",
    "        return pred\n",
    "```\n",
    "\n",
    "In this case, the chain of computation is very simple. We just put the batch on the right device and feed the encoded tokens into the model. We feed the tokens with `<bos>` into the model.\n",
    "When adding the `<bos>` token, in fact, we shift all the tokens by one element. This way, our input corresponds to the previous token while our model tries to predict the current one.\n",
    "\n",
    "#### Compute Objectives\n",
    "\n",
    "Let's take a look now into the `compute_objectives` method that takes in input the targets, the predictions, and estimates a loss function:\n",
    "\n",
    "```python\n",
    "    def compute_objectives(self, predictions, batch, stage):\n",
    "        \"\"\"Computes the loss given the predicted and targeted outputs.\n",
    "\n",
    "        Arguments\n",
    "        ---------\n",
    "        predictions : torch.Tensor\n",
    "            The posterior probabilities from `compute_forward`.\n",
    "        batch : PaddedBatch\n",
    "            This batch object contains all the relevant tensors for computation.\n",
    "        stage : sb.Stage\n",
    "            One of sb.Stage.TRAIN, sb.Stage.VALID, or sb.Stage.TEST.\n",
    "\n",
    "        Returns\n",
    "        -------\n",
    "        loss : torch.Tensor\n",
    "            A one-element tensor used for backpropagating the gradient.\n",
    "        \"\"\"\n",
    "        batch = batch.to(self.device)\n",
    "        tokens_eos, tokens_len = batch.tokens_eos\n",
    "        loss = self.hparams.compute_cost(\n",
    "            predictions, tokens_eos, length=tokens_len\n",
    "        )\n",
    "        return loss\n",
    "```\n",
    "The predictions are those computed in the forward method. The cost function is evaluated by comparing these predictions with the target tokens. We here use the tokens with the special `<eos>` token at the end because we want to predict when the sentence ends as well.\n",
    "\n",
    "####**Other methods**\n",
    "Beyond these two important functions, we have some other methods that are used by the brain class. In particular, the `fit_batch` trains each batch of data (by computing the gradient with the backward method and the updates with step one). The `on_stage_end`, is called at the end of each stage (e.g, at the end of each training epoch) and mainly takes care of statistic management, learning rate annealing, and checkpointing. [For a more detailed description of the brain class, please take a look into this tutorial](https://speechbrain.readthedocs.io/en/latest/tutorials/basics/brain-class.html). For more information on checkpointing, [take a look here](https://speechbrain.readthedocs.io/en/latest/tutorials/basics/checkpointing.html)\n",
    "\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "tFJ34alleSBH"
   },
   "source": [
    "### Step 4: Training the Attention-Based End-to-End Speech Recognizer\n",
    "\n",
    "Now it's time to train our attention-based end-to-end speech recognizer. This offline recognizer employs a sophisticated architecture, utilizing a combination of convolutional, recurrent, and fully connected models in the encoder, and an autoregressive GRU decoder.\n",
    "\n",
    "The crucial link between the encoder and decoder is an attention mechanism. To enhance performance, the final sequence of words is obtained through beam search, coupled with the previously trained RNNLM.\n",
    "\n",
    "#### Architecture Overview:\n",
    "- **Encoder:** Combines convolutional, recurrent, and fully connected models.\n",
    "- **Decoder:** Autoregressive GRU decoder.\n",
    "- **Attention Mechanism:** Enhances information flow between the encoder and decoder.\n",
    "- **CTC (Connectionist Temporal Classification):** Jointly trained with the attention-based system, applied on top of the encoder.\n",
    "- **Data Augmentation:** Employed techniques to augment data and improve overall system performance.\n",
    "\n",
    "\n",
    "### Train the speech recognizer\n",
    "To train the speech recognizer, run the following code:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "55c4jnVCeoGa"
   },
   "outputs": [],
   "source": [
    "%cd /content/speechbrain/templates/speech_recognition/ASR\n",
    "!python train.py train.yaml --number_of_epochs=1  --batch_size=2  --enable_add_reverb=False --enable_add_noise=False #To speed up"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "z-JCc1qyXpFI"
   },
   "source": [
    "Executing this code may require a considerable amount of time on Google Colab. Monitoring the log, you'll observe a progressive improvement in loss after each epoch.\n",
    "\n",
    "Similar to the RNNLM section, the specified `output_folder` will include the previously discussed files and folders. Additionally, a file named `wer.txt` is saved, providing a comprehensive report on the Word Error Rate (WER) achieved for each test sentence. This file not only captures the WER values but also includes the alignment information with the true transcription for enhanced analysis:\n",
    "\n",
    "\n",
    "```\n",
    "%WER 3.09 [ 1622 / 52576, 167 ins, 171 del, 1284 sub ]\n",
    "%SER 33.66 [ 882 / 2620 ]\n",
    "Scored 2620 sentences, 0 not present in hyp.\n",
    "================================================================================\n",
    "ALIGNMENTS\n",
    "\n",
    "Format:\n",
    "<utterance-id>, WER DETAILS\n",
    "<eps> ; reference  ; on ; the ; first ;  line\n",
    "  I   ;     S      ; =  ;  =  ;   S   ;   D  \n",
    " and  ; hypothesis ; on ; the ; third ; <eps>\n",
    "================================================================================\n",
    "672-122797-0033, %WER 0.00 [ 0 / 2, 0 ins, 0 del, 0 sub ]\n",
    "A ; STORY\n",
    "= ;   =  \n",
    "A ; STORY\n",
    "================================================================================\n",
    "2094-142345-0041, %WER 0.00 [ 0 / 1, 0 ins, 0 del, 0 sub ]\n",
    "DIRECTION\n",
    "    =    \n",
    "DIRECTION\n",
    "================================================================================\n",
    "2830-3980-0026, %WER 50.00 [ 1 / 2, 0 ins, 0 del, 1 sub ]\n",
    "VERSE ; TWO\n",
    "  S   ;  =\n",
    "FIRST ; TWO\n",
    "================================================================================\n",
    "237-134500-0025, %WER 50.00 [ 1 / 2, 0 ins, 0 del, 1 sub ]\n",
    "OH ;  EMIL\n",
    "=  ;   S  \n",
    "OH ; AMIEL\n",
    "================================================================================\n",
    "7127-75947-0012, %WER 0.00 [ 0 / 2, 0 ins, 0 del, 0 sub ]\n",
    "INDEED ; AH\n",
    "  =    ; =\n",
    "INDEED ; AH\n",
    "================================================================================\n",
    "\n",
    "```\n",
    "\n",
    "\n",
    "\n",
    "Let's now take a closer look into the hyperparameter (`train.yaml`)  and experiment script (`train.py`).\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "dfHa8TQMYUle"
   },
   "source": [
    "### Hyperparameters\n",
    "\n",
    "The hyperparameter file starts with the definition of basic things, such as seed and path settings:\n",
    "\n",
    "```yaml\n",
    "# Seed needs to be set at top of yaml, before objects with parameters are instantiated\n",
    "seed: 2602\n",
    "__set_seed: !apply:torch.manual_seed [!ref <seed>]\n",
    "\n",
    "# If you plan to train a system on an HPC cluster with a big dataset,\n",
    "# we strongly suggest doing the following:\n",
    "# 1- Compress the dataset in a single tar or zip file.\n",
    "# 2- Copy your dataset locally (i.e., the local disk of the computing node).\n",
    "# 3- Uncompress the dataset in the local folder.\n",
    "# 4- Set data_folder with the local path\n",
    "# Reading data from the local disk of the compute node (e.g. $SLURM_TMPDIR with SLURM-based clusters) is very important.\n",
    "# It allows you to read the data much faster without slowing down the shared filesystem.\n",
    "\n",
    "data_folder: ../data # In this case, data will be automatically downloaded here.\n",
    "data_folder_noise: !ref <data_folder>/noise # The noisy sequencies for data augmentation will automatically be downloaded here.\n",
    "data_folder_rir: !ref <data_folder>/rir # The impulse responses used for data augmentation will automatically be downloaded here.\n",
    "\n",
    "# Data for augmentation\n",
    "NOISE_DATASET_URL: https://www.dropbox.com/scl/fi/a09pj97s5ifan81dqhi4n/noises.zip?rlkey=j8b0n9kdjdr32o1f06t0cw5b7&dl=1\n",
    "RIR_DATASET_URL: https://www.dropbox.com/scl/fi/linhy77c36mu10965a836/RIRs.zip?rlkey=pg9cu8vrpn2u173vhiqyu743u&dl=1\n",
    "\n",
    "output_folder: !ref results/CRDNN_BPE_960h_LM/<seed>\n",
    "test_wer_file: !ref <output_folder>/wer_test.txt\n",
    "save_folder: !ref <output_folder>/save\n",
    "train_log: !ref <output_folder>/train_log.txt\n",
    "\n",
    "# Language model (LM) pretraining\n",
    "# NB: To avoid mismatch, the speech recognizer must be trained with the same\n",
    "# tokenizer used for LM training. Here, we download everything from the\n",
    "# speechbrain HuggingFace repository. However, a local path pointing to a\n",
    "# directory containing the lm.ckpt and tokenizer.ckpt may also be specified\n",
    "# instead. E.g if you want to use your own LM / tokenizer.\n",
    "pretrained_path: speechbrain/asr-crdnn-rnnlm-librispeech\n",
    "\n",
    "\n",
    "# Path where data manifest files will be stored. The data manifest files are created by the\n",
    "# data preparation script\n",
    "train_annotation: ../train.json\n",
    "valid_annotation: ../valid.json\n",
    "test_annotation: ../test.json\n",
    "noise_annotation: ../noise.csv\n",
    "rir_annotation: ../rir.csv\n",
    "\n",
    "skip_prep: False\n",
    "\n",
    "# The train logger writes training statistics to a file, as well as stdout.\n",
    "train_logger: !new:speechbrain.utils.train_logger.FileTrainLogger\n",
    "    save_file: !ref <train_log>\n",
    "```\n",
    "\n",
    "The `data_folder` corresponds to the path where the mini-librispeech is stored. If not available, the mini-librispeech dataset will be downloaded here. As mentioned, the script also supports data augmentation. To do it, we use the impulse responses and noise sequences of the open rir dataset (again, if not available it will be downloaded here).\n",
    "\n",
    "We also specify the folder where the language model is saved. In this case, we use the official pre-trained language model available on HuggingFace, but you can change it and use the one trained at the previous step (you should point to the checkpoint in the folder where the best `model.cpkt` is stored).\n",
    "What is important is that the set of tokens used for the LM and the one used for training the speech recognizer match exactly.\n",
    "\n",
    "We also have to specify the data manifest files for training, validation, and test. If not available, these files will be created by the data preparation script called in `train.py`.\n",
    "\n",
    "After that, we define a bunch of parameters for training, feature extraction, model definition, and decoding:\n",
    "\n",
    "```yaml\n",
    "# Training parameters\n",
    "number_of_epochs: 15\n",
    "number_of_ctc_epochs: 5\n",
    "batch_size: 8\n",
    "lr: 1.0\n",
    "ctc_weight: 0.5\n",
    "sorting: ascending\n",
    "ckpt_interval_minutes: 15 # save checkpoint every N min\n",
    "label_smoothing: 0.1\n",
    "\n",
    "# Dataloader options\n",
    "train_dataloader_opts:\n",
    "    batch_size: !ref <batch_size>\n",
    "\n",
    "valid_dataloader_opts:\n",
    "    batch_size: !ref <batch_size>\n",
    "\n",
    "test_dataloader_opts:\n",
    "    batch_size: !ref <batch_size>\n",
    "\n",
    "\n",
    "# Feature parameters\n",
    "sample_rate: 16000\n",
    "n_fft: 400\n",
    "n_mels: 40\n",
    "\n",
    "# Model parameters\n",
    "activation: !name:torch.nn.LeakyReLU\n",
    "dropout: 0.15\n",
    "cnn_blocks: 2\n",
    "cnn_channels: (128, 256)\n",
    "inter_layer_pooling_size: (2, 2)\n",
    "cnn_kernelsize: (3, 3)\n",
    "time_pooling_size: 4\n",
    "rnn_class: !name:speechbrain.nnet.RNN.LSTM\n",
    "rnn_layers: 4\n",
    "rnn_neurons: 1024\n",
    "rnn_bidirectional: True\n",
    "dnn_blocks: 2\n",
    "dnn_neurons: 512\n",
    "emb_size: 128\n",
    "dec_neurons: 1024\n",
    "output_neurons: 1000  # Number of tokens (same as LM)\n",
    "blank_index: 0\n",
    "bos_index: 0\n",
    "eos_index: 0\n",
    "unk_index: 0\n",
    "\n",
    "# Decoding parameters\n",
    "min_decode_ratio: 0.0\n",
    "max_decode_ratio: 1.0\n",
    "valid_beam_size: 8\n",
    "test_beam_size: 80\n",
    "eos_threshold: 1.5\n",
    "using_max_attn_shift: True\n",
    "max_attn_shift: 240\n",
    "lm_weight: 0.50\n",
    "ctc_weight_decode: 0.0\n",
    "coverage_penalty: 1.5\n",
    "temperature: 1.25\n",
    "temperature_lm: 1.25\n",
    "```\n",
    "\n",
    "For instance, we define the number of epochs, the initial learning rate, the batch size, the weight of the CTC loss, and many others.\n",
    "\n",
    "By setting sorting to `ascending`, we sort all the sentences in ascending order before creating the batches. This minimizes the need for zero paddings and thus makes training faster without losing performance (at least in this task with this model).\n",
    "\n",
    "Many other parameters, such as those for data augmentations, are defined. For the exact meaning of all of them, you can refer to the docstring of the function/class using this hyperparameter.\n",
    "\n",
    "In the next block, we define the most important classes that are needed to implement the speech recognizer:\n",
    "\n",
    "\n",
    "```yaml\n",
    "# The first object passed to the Brain class is this \"Epoch Counter\"\n",
    "# which is saved by the Checkpointer so that training can be resumed\n",
    "# if it gets interrupted at any point.\n",
    "epoch_counter: !new:speechbrain.utils.epoch_loop.EpochCounter\n",
    "    limit: !ref <number_of_epochs>\n",
    "\n",
    "# Feature extraction\n",
    "compute_features: !new:speechbrain.lobes.features.Fbank\n",
    "    sample_rate: !ref <sample_rate>\n",
    "    n_fft: !ref <n_fft>\n",
    "    n_mels: !ref <n_mels>\n",
    "\n",
    "# Feature normalization (mean and std)\n",
    "normalize: !new:speechbrain.processing.features.InputNormalization\n",
    "    norm_type: global\n",
    "\n",
    "# Added noise and reverb come from OpenRIR dataset, automatically\n",
    "# downloaded and prepared with this Environmental Corruption class.\n",
    "env_corrupt: !new:speechbrain.lobes.augment.EnvCorrupt\n",
    "    openrir_folder: !ref <data_folder_rirs>\n",
    "    babble_prob: 0.0\n",
    "    reverb_prob: 0.0\n",
    "    noise_prob: 1.0\n",
    "    noise_snr_low: 0\n",
    "    noise_snr_high: 15\n",
    "\n",
    "# Adds speech change + time and frequnecy dropouts (time-domain implementation).\n",
    "augmentation: !new:speechbrain.lobes.augment.TimeDomainSpecAugment\n",
    "    sample_rate: !ref <sample_rate>\n",
    "    speeds: [95, 100, 105]\n",
    "\n",
    "# The CRDNN model is an encoder that combines CNNs, RNNs, and DNNs.\n",
    "encoder: !new:speechbrain.lobes.models.CRDNN.CRDNN\n",
    "    input_shape: [null, null, !ref <n_mels>]\n",
    "    activation: !ref <activation>\n",
    "    dropout: !ref <dropout>\n",
    "    cnn_blocks: !ref <cnn_blocks>\n",
    "    cnn_channels: !ref <cnn_channels>\n",
    "    cnn_kernelsize: !ref <cnn_kernelsize>\n",
    "    inter_layer_pooling_size: !ref <inter_layer_pooling_size>\n",
    "    time_pooling: True\n",
    "    using_2d_pooling: False\n",
    "    time_pooling_size: !ref <time_pooling_size>\n",
    "    rnn_class: !ref <rnn_class>\n",
    "    rnn_layers: !ref <rnn_layers>\n",
    "    rnn_neurons: !ref <rnn_neurons>\n",
    "    rnn_bidirectional: !ref <rnn_bidirectional>\n",
    "    rnn_re_init: True\n",
    "    dnn_blocks: !ref <dnn_blocks>\n",
    "    dnn_neurons: !ref <dnn_neurons>\n",
    "    use_rnnp: False\n",
    "\n",
    "# Embedding (from indexes to an embedding space of dimension emb_size).\n",
    "embedding: !new:speechbrain.nnet.embedding.Embedding\n",
    "    num_embeddings: !ref <output_neurons>\n",
    "    embedding_dim: !ref <emb_size>\n",
    "\n",
    "# Attention-based RNN decoder.\n",
    "decoder: !new:speechbrain.nnet.RNN.AttentionalRNNDecoder\n",
    "    enc_dim: !ref <dnn_neurons>\n",
    "    input_size: !ref <emb_size>\n",
    "    rnn_type: gru\n",
    "    attn_type: location\n",
    "    hidden_size: !ref <dec_neurons>\n",
    "    attn_dim: 1024\n",
    "    num_layers: 1\n",
    "    scaling: 1.0\n",
    "    channels: 10\n",
    "    kernel_size: 100\n",
    "    re_init: True\n",
    "    dropout: !ref <dropout>\n",
    "\n",
    "# Linear transformation on the top of the encoder.\n",
    "ctc_lin: !new:speechbrain.nnet.linear.Linear\n",
    "    input_size: !ref <dnn_neurons>\n",
    "    n_neurons: !ref <output_neurons>\n",
    "\n",
    "# Linear transformation on the top of the decoder.\n",
    "seq_lin: !new:speechbrain.nnet.linear.Linear\n",
    "    input_size: !ref <dec_neurons>\n",
    "    n_neurons: !ref <output_neurons>\n",
    "\n",
    "# Final softmax (for log posteriors computation).\n",
    "log_softmax: !new:speechbrain.nnet.activations.Softmax\n",
    "    apply_log: True\n",
    "\n",
    "# Cost definition for the CTC part.\n",
    "ctc_cost: !name:speechbrain.nnet.losses.ctc_loss\n",
    "    blank_index: !ref <blank_index>\n",
    "\n",
    "# Tokenizer initialization\n",
    "tokenizer: !new:sentencepiece.SentencePieceProcessor\n",
    "\n",
    "# Objects in \"modules\" dict will have their parameters moved to the correct\n",
    "# device, as well as having train()/eval() called on them by the Brain class\n",
    "modules:\n",
    "    encoder: !ref <encoder>\n",
    "    embedding: !ref <embedding>\n",
    "    decoder: !ref <decoder>\n",
    "    ctc_lin: !ref <ctc_lin>\n",
    "    seq_lin: !ref <seq_lin>\n",
    "    normalize: !ref <normalize>\n",
    "    env_corrupt: !ref <env_corrupt>\n",
    "    lm_model: !ref <lm_model>\n",
    "\n",
    "# Gathering all the submodels in a single model object.\n",
    "model: !new:torch.nn.ModuleList\n",
    "    - - !ref <encoder>\n",
    "      - !ref <embedding>\n",
    "      - !ref <decoder>\n",
    "      - !ref <ctc_lin>\n",
    "      - !ref <seq_lin>\n",
    "\n",
    "# This is the RNNLM that is used according to the Huggingface repository\n",
    "# NB: It has to match the pre-trained RNNLM!!\n",
    "lm_model: !new:speechbrain.lobes.models.RNNLM.RNNLM\n",
    "    output_neurons: !ref <output_neurons>\n",
    "    embedding_dim: !ref <emb_size>\n",
    "    activation: !name:torch.nn.LeakyReLU\n",
    "    dropout: 0.0\n",
    "    rnn_layers: 2\n",
    "    rnn_neurons: 2048\n",
    "    dnn_blocks: 1\n",
    "    dnn_neurons: 512\n",
    "    return_hidden: True  # For inference\n",
    "```\n",
    "\n",
    "For instance, we define the function for computing features and normalizing them. We define the class for environmental corruption and data augmentation ([please, see this tutorial](https://speechbrain.readthedocs.io/en/latest/tutorials/preprocessing/speech-augmentation.html)), the architecture of the encoder, decoder, and the other models need by the speech recognizer.\n",
    "\n",
    "\n",
    "We then report the parameters for beasearch:\n",
    "\n",
    "```yaml\n",
    "# Define scorers for beam search\n",
    "\n",
    "# If ctc_scorer is set, the decoder uses CTC + attention beamsearch. This\n",
    "# improves the performance, but slows down decoding.\n",
    "ctc_scorer: !new:speechbrain.decoders.scorer.CTCScorer\n",
    "    eos_index: !ref <eos_index>\n",
    "    blank_index: !ref <blank_index>\n",
    "    ctc_fc: !ref <ctc_lin>\n",
    "\n",
    "# If coverage_scorer is set, coverage penalty is applied based on accumulated\n",
    "# attention weights during beamsearch.\n",
    "coverage_scorer: !new:speechbrain.decoders.scorer.CoverageScorer\n",
    "    vocab_size: !ref <output_neurons>\n",
    "\n",
    "# If the lm_scorer is set, a language model\n",
    "# is applied (with a weight specified in scorer).\n",
    "rnnlm_scorer: !new:speechbrain.decoders.scorer.RNNLMScorer\n",
    "    language_model: !ref <lm_model>\n",
    "    temperature: !ref <temperature_lm>\n",
    "\n",
    "# Gathering all scorers in a scorer instance for beamsearch:\n",
    "# - full_scorers are scorers which score on full vocab set, while partial_scorers\n",
    "# are scorers which score on pruned tokens.\n",
    "# - The number of pruned tokens is decided by scorer_beam_scale * beam_size.\n",
    "# - For some scorers like ctc_scorer, ngramlm_scorer, putting them\n",
    "# into full_scorers list would be too heavy. partial_scorers are more\n",
    "# efficient because they score on pruned tokens at little cost of\n",
    "# performance drop. For other scorers, please see the speechbrain.decoders.scorer.\n",
    "test_scorer: !new:speechbrain.decoders.scorer.ScorerBuilder\n",
    "    scorer_beam_scale: 1.5\n",
    "    full_scorers: [\n",
    "        !ref <rnnlm_scorer>,\n",
    "        !ref <coverage_scorer>]\n",
    "    partial_scorers: [!ref <ctc_scorer>]\n",
    "    weights:\n",
    "        rnnlm: !ref <lm_weight>\n",
    "        coverage: !ref <coverage_penalty>\n",
    "        ctc: !ref <ctc_weight_decode>\n",
    "\n",
    "valid_scorer: !new:speechbrain.decoders.scorer.ScorerBuilder\n",
    "    full_scorers: [!ref <coverage_scorer>]\n",
    "    weights:\n",
    "        coverage: !ref <coverage_penalty>\n",
    "\n",
    "# Beamsearch is applied on the top of the decoder. For a description of\n",
    "# the other parameters, please see the speechbrain.decoders.S2SRNNBeamSearcher.\n",
    "\n",
    "# It makes sense to have a lighter search during validation. In this case,\n",
    "# we don't use scorers during decoding.\n",
    "valid_search: !new:speechbrain.decoders.S2SRNNBeamSearcher\n",
    "    embedding: !ref <embedding>\n",
    "    decoder: !ref <decoder>\n",
    "    linear: !ref <seq_lin>\n",
    "    bos_index: !ref <bos_index>\n",
    "    eos_index: !ref <eos_index>\n",
    "    min_decode_ratio: !ref <min_decode_ratio>\n",
    "    max_decode_ratio: !ref <max_decode_ratio>\n",
    "    beam_size: !ref <valid_beam_size>\n",
    "    eos_threshold: !ref <eos_threshold>\n",
    "    using_max_attn_shift: !ref <using_max_attn_shift>\n",
    "    max_attn_shift: !ref <max_attn_shift>\n",
    "    temperature: !ref <temperature>\n",
    "    scorer: !ref <valid_scorer>\n",
    "\n",
    "# The final decoding on the test set can be more computationally demanding.\n",
    "# In this case, we use the LM + CTC probabilities during decoding as well,\n",
    "# which are defined in scorer.\n",
    "# Please, remove scorer if you need a faster decoder.\n",
    "test_search: !new:speechbrain.decoders.S2SRNNBeamSearcher\n",
    "    embedding: !ref <embedding>\n",
    "    decoder: !ref <decoder>\n",
    "    linear: !ref <seq_lin>\n",
    "    bos_index: !ref <bos_index>\n",
    "    eos_index: !ref <eos_index>\n",
    "    min_decode_ratio: !ref <min_decode_ratio>\n",
    "    max_decode_ratio: !ref <max_decode_ratio>\n",
    "    beam_size: !ref <test_beam_size>\n",
    "    eos_threshold: !ref <eos_threshold>\n",
    "    using_max_attn_shift: !ref <using_max_attn_shift>\n",
    "    max_attn_shift: !ref <max_attn_shift>\n",
    "    temperature: !ref <temperature>\n",
    "    scorer: !ref <test_scorer>\n",
    "```\n",
    "We here employ different hyperparameters for validation and test beamsearch. In particular, a smaller beam size is used for the validation stage. The reason is that validation is done at the end of each epoch and should thus be done quickly. Evaluation, instead, is done only once at the end and we can be more accurate.\n",
    "\n",
    "\n",
    "Finally, we declare the last objects needed by the training recipes, such as  lr_annealing, optimizer, checkpointer, etc:\n",
    "\n",
    "\n",
    "```yaml\n",
    " This function manages learning rate annealing over the epochs.\n",
    "# We here use the NewBoB algorithm, that anneals the learning rate if\n",
    "# the improvements over two consecutive epochs is less than the defined\n",
    "# threshold.\n",
    "lr_annealing: !new:speechbrain.nnet.schedulers.NewBobScheduler\n",
    "    initial_value: !ref <lr>\n",
    "    improvement_threshold: 0.0025\n",
    "    annealing_factor: 0.8\n",
    "    patient: 0\n",
    "\n",
    "# This optimizer will be constructed by the Brain class after all parameters\n",
    "# are moved to the correct device. Then it will be added to the checkpointer.\n",
    "opt_class: !name:torch.optim.Adadelta\n",
    "    lr: !ref <lr>\n",
    "    rho: 0.95\n",
    "    eps: 1.e-8\n",
    "\n",
    "# Functions that compute the statistics to track during the validation step.\n",
    "error_rate_computer: !name:speechbrain.utils.metric_stats.ErrorRateStats\n",
    "\n",
    "cer_computer: !name:speechbrain.utils.metric_stats.ErrorRateStats\n",
    "    split_tokens: True\n",
    "\n",
    "# This object is used for saving the state of training both so that it\n",
    "# can be resumed if it gets interrupted, and also so that the best checkpoint\n",
    "# can be later loaded for evaluation or inference.\n",
    "checkpointer: !new:speechbrain.utils.checkpoints.Checkpointer\n",
    "    checkpoints_dir: !ref <save_folder>\n",
    "    recoverables:\n",
    "        model: !ref <model>\n",
    "        scheduler: !ref <lr_annealing>\n",
    "        normalizer: !ref <normalize>\n",
    "        counter: !ref <epoch_counter>\n",
    "\n",
    "# This object is used to pretrain the language model and the tokenizers\n",
    "# (defined above). In this case, we also pretrain the ASR model (to make\n",
    "# sure the model converges on a small amount of data)\n",
    "pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainer\n",
    "    collect_in: !ref <save_folder>\n",
    "    loadables:\n",
    "        lm: !ref <lm_model>\n",
    "        tokenizer: !ref <tokenizer>\n",
    "        model: !ref <model>\n",
    "    paths:\n",
    "        lm: !ref <pretrained_path>/lm.ckpt\n",
    "        tokenizer: !ref <pretrained_path>/tokenizer.ckpt\n",
    "        model: !ref <pretrained_path>/asr.ckpt\n",
    "```\n",
    "\n",
    "The final object is the pretrainer that links the language model, the tokenizer, and the acoustic speech recognition model with their corresponding files used for pre-training.  We here pre-train the acoustic model as well. One such a small dataset, it is very hard to make an end-to-end speech recognizer converging and we thus use another model to pre-trained it (you should skip this part when training on a larger dataset)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "6xcAJ4OlYZCh"
   },
   "source": [
    "### Experiment file\n",
    "Let's now see how the different elements declared in the yaml files are connected in the train.py.\n",
    "The training script closely follows the one already described for the language model.\n",
    "\n",
    "The `main` function starts with the implementation of basic functionalities such as parsing the command line, initializing the distributed data-parallel (needed for multiple GPU training), and reading the yaml file.\n",
    "\n",
    "\n",
    "\n",
    "```python\n",
    "\n",
    "if __name__ == \"__main__\":\n",
    "\n",
    "    # Reading command line arguments\n",
    "    hparams_file, run_opts, overrides = sb.parse_arguments(sys.argv[1:])\n",
    "\n",
    "    # Initialize ddp (useful only for multi-GPU DDP training)\n",
    "    sb.utils.distributed.ddp_init_group(run_opts)\n",
    "\n",
    "    # Load hyperparameters file with command-line overrides\n",
    "    with open(hparams_file) as fin:\n",
    "        hparams = load_hyperpyyaml(fin, overrides)\n",
    "\n",
    "    # Create experiment directory\n",
    "    sb.create_experiment_directory(\n",
    "        experiment_directory=hparams[\"output_folder\"],\n",
    "        hyperparams_to_save=hparams_file,\n",
    "        overrides=overrides,\n",
    "    )\n",
    "\n",
    "    # Data preparation, to be run on only one process.\n",
    "    if not hparams[\"skip_prep\"]:\n",
    "        sb.utils.distributed.run_on_main(\n",
    "            prepare_mini_librispeech,\n",
    "            kwargs={\n",
    "                \"data_folder\": hparams[\"data_folder\"],\n",
    "                \"save_json_train\": hparams[\"train_annotation\"],\n",
    "                \"save_json_valid\": hparams[\"valid_annotation\"],\n",
    "                \"save_json_test\": hparams[\"test_annotation\"],\n",
    "            },\n",
    "        )\n",
    "    sb.utils.distributed.run_on_main(hparams[\"prepare_noise_data\"])\n",
    "    sb.utils.distributed.run_on_main(hparams[\"prepare_rir_data\"])\n",
    "\n",
    "```\n",
    "The yaml file is read with the `load_hyperpyyaml` function. After reading it,  we will have all the declared object initialized and available with the hparams dictionary along with the other functions and variables (e.g, `hparams['model']`, `hparams['test_search']`,`hparams['batch_size']`).\n",
    "\n",
    "After that, we run the data preparation that has the goal of creating the data manifest file (if not already available). This operation requires writing some files on a disk. For this reason, we have to use the `sb.utils.distributed.run_on_main` to make sure that this operation is executed by the main process only. This avoids possible conflicts when using multiple GPUs with DDP. For more info on multi-gpu training in Speechbrai, [please see this tutorial](https://speechbrain.readthedocs.io/en/latest/multigpu.html).\n",
    "\n",
    "#### Data-IO Pipeline\n",
    "At this point, we can create the dataset object that we will use for training, validation, and test loops:\n",
    "\n",
    "```python\n",
    "    # We can now directly create the datasets for training, valid, and test\n",
    "    datasets = dataio_prepare(hparams)\n",
    "```\n",
    "\n",
    "This function allows users to fully customize the data reading pipeline. Let's take a closer look into it:\n",
    "\n",
    "```python\n",
    "def dataio_prepare(hparams):\n",
    "    \"\"\"This function prepares the datasets to be used in the brain class.\n",
    "    It also defines the data processing pipeline through user-defined functions.\n",
    "\n",
    "\n",
    "    Arguments\n",
    "    ---------\n",
    "    hparams : dict\n",
    "        This dictionary is loaded from the `train.yaml` file, and it includes\n",
    "        all the hyperparameters needed for dataset construction and loading.\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    datasets : dict\n",
    "        Dictionary containing \"train\", \"valid\", and \"test\" keys that correspond\n",
    "        to the DynamicItemDataset objects.\n",
    "    \"\"\"\n",
    "    # Define audio pipeline. In this case, we simply read the path contained\n",
    "    # in the variable wav with the audio reader.\n",
    "    @sb.utils.data_pipeline.takes(\"wav\")\n",
    "    @sb.utils.data_pipeline.provides(\"sig\")\n",
    "    def audio_pipeline(wav):\n",
    "        \"\"\"Load the audio signal. This is done on the CPU in the `collate_fn`.\"\"\"\n",
    "        sig = sb.dataio.dataio.read_audio(wav)\n",
    "        return sig\n",
    "\n",
    "    # Define text processing pipeline. We start from the raw text and then\n",
    "    # encode it using the tokenizer. The tokens with BOS are used for feeding\n",
    "    # decoder during training, the tokens with EOS for computing the cost function.\n",
    "    # The tokens without BOS or EOS is for computing CTC loss.\n",
    "    @sb.utils.data_pipeline.takes(\"words\")\n",
    "    @sb.utils.data_pipeline.provides(\n",
    "        \"words\", \"tokens_list\", \"tokens_bos\", \"tokens_eos\", \"tokens\"\n",
    "    )\n",
    "    def text_pipeline(words):\n",
    "        \"\"\"Processes the transcriptions to generate proper labels\"\"\"\n",
    "        yield words\n",
    "        tokens_list = hparams[\"tokenizer\"].encode_as_ids(words)\n",
    "        yield tokens_list\n",
    "        tokens_bos = torch.LongTensor([hparams[\"bos_index\"]] + (tokens_list))\n",
    "        yield tokens_bos\n",
    "        tokens_eos = torch.LongTensor(tokens_list + [hparams[\"eos_index\"]])\n",
    "        yield tokens_eos\n",
    "        tokens = torch.LongTensor(tokens_list)\n",
    "        yield tokens\n",
    "\n",
    "    # Define datasets from json data manifest file\n",
    "    # Define datasets sorted by ascending lengths for efficiency\n",
    "    datasets = {}\n",
    "    data_folder = hparams[\"data_folder\"]\n",
    "    for dataset in [\"train\", \"valid\", \"test\"]:\n",
    "        datasets[dataset] = sb.dataio.dataset.DynamicItemDataset.from_json(\n",
    "            json_path=hparams[f\"{dataset}_annotation\"],\n",
    "            replacements={\"data_root\": data_folder},\n",
    "            dynamic_items=[audio_pipeline, text_pipeline],\n",
    "            output_keys=[\n",
    "                \"id\",\n",
    "                \"sig\",\n",
    "                \"words\",\n",
    "                \"tokens_bos\",\n",
    "                \"tokens_eos\",\n",
    "                \"tokens\",\n",
    "            ],\n",
    "        )\n",
    "        hparams[f\"{dataset}_dataloader_opts\"][\"shuffle\"] = False\n",
    "\n",
    "    # Sorting traiing data with ascending order makes the code  much\n",
    "    # faster  because we minimize zero-padding. In most of the cases, this\n",
    "    # does not harm the performance.\n",
    "    if hparams[\"sorting\"] == \"ascending\":\n",
    "        datasets[\"train\"] = datasets[\"train\"].filtered_sorted(sort_key=\"length\")\n",
    "        hparams[\"train_dataloader_opts\"][\"shuffle\"] = False\n",
    "\n",
    "    elif hparams[\"sorting\"] == \"descending\":\n",
    "        datasets[\"train\"] = datasets[\"train\"].filtered_sorted(\n",
    "            sort_key=\"length\", reverse=True\n",
    "        )\n",
    "        hparams[\"train_dataloader_opts\"][\"shuffle\"] = False\n",
    "\n",
    "    elif hparams[\"sorting\"] == \"random\":\n",
    "        hparams[\"train_dataloader_opts\"][\"shuffle\"] = True\n",
    "        pass\n",
    "\n",
    "    else:\n",
    "        raise NotImplementedError(\n",
    "            \"sorting must be random, ascending or descending\"\n",
    "        )\n",
    "    return datasets\n",
    "```\n",
    "\n",
    "Within `dataio_prepare` we define subfunctions for processing the entries defined in the JSON files.\n",
    "The first function, called `audio_pipeline` takes the path of the audio signal (`wav`) and reads it. It returns a tensor containing the read speech sentence. The entry in input to this function (i.e, `wav`) must have the same name of the corresponding key in the data manifest file:\n",
    "\n",
    "```json\n",
    "  \"1867-154075-0032\": {\n",
    "    \"wav\": \"{data_root}/LibriSpeech/train-clean-5/1867/154075/1867-154075-0032.flac\",\n",
    "    \"length\": 16.09,\n",
    "    \"words\": \"AND HE BRUSHED A HAND ACROSS HIS FOREHEAD AND WAS INSTANTLY HIMSELF CALM AND COOL VERY WELL THEN IT SEEMS I'VE MADE AN ASS OF MYSELF BUT I'LL TRY TO MAKE UP FOR IT NOW WHAT ABOUT CAROLINE\"\n",
    "  },\n",
    "```\n",
    "\n",
    "Similarly, we define another function called `text_pipeline` for processing the signal transcriptions and put them in a format usable by the defined model. The function reads the string `words` defined in the JSON file and tokenizes it (outputting the index of each token). It return the sequence of tokens with the special begin-of-sentence `<bos>` token in front, and the version with the end-of-sentence `<eos>` token at the end aswell. We will see later why these additional elements are needed.\n",
    "\n",
    "We then create the `DynamicItemDataset` and connect it with the processing functions defined above. We define the desired output keys. These keys will be available in the brain class within the batch variable as:\n",
    "- batch.id\n",
    "- batch.sig\n",
    "- batch.words\n",
    "- batch.tokens_bos\n",
    "- batch.tokens_eos\n",
    "- batch.tokens\n",
    "\n",
    "The last part of the `dataio_prepare` function manages data sorting. In this case, we sort data in ascending order to minimize zero paddings and speeding training up. For more information on the dataloaders, [please see this tutorial](https://colab.research.google.com/drive/1AiVJZhZKwEI4nFGANKXEe-ffZFfvXKwH?usp=sharing)\n",
    "\n",
    "\n",
    "After the definition of the dataio function, we perform pre-training of the language model, ASR model, and tokenizer:\n",
    "\n",
    "\n",
    "```python\n",
    "    run_on_main(hparams[\"pretrainer\"].collect_files)\n",
    "    hparams[\"pretrainer\"].load_collected(device=run_opts[\"device\"])\n",
    "```\n",
    "We here use the `run_on_main` wrapper because the ` collect_files` method might need to download the pre-trained model from the web. This operation should be done by a single process only even when using multiple GPUs with DDP).\n",
    "\n",
    "At this point we initialize the Brain class and use it for running training and evaluation:\n",
    "\n",
    "\n",
    "```python\n",
    "\n",
    "    # Trainer initialization\n",
    "    asr_brain = ASR(\n",
    "        modules=hparams[\"modules\"],\n",
    "        opt_class=hparams[\"opt_class\"],\n",
    "        hparams=hparams,\n",
    "        run_opts=run_opts,\n",
    "        checkpointer=hparams[\"checkpointer\"],\n",
    "    )\n",
    "\n",
    "    # Training\n",
    "    asr_brain.fit(\n",
    "        asr_brain.hparams.epoch_counter,\n",
    "        datasets[\"train\"],\n",
    "        datasets[\"valid\"],\n",
    "        train_loader_kwargs=hparams[\"train_dataloader_opts\"],\n",
    "        valid_loader_kwargs=hparams[\"valid_dataloader_opts\"],\n",
    "    )\n",
    "\n",
    "    # Load best checkpoint for evaluation\n",
    "    test_stats = asr_brain.evaluate(\n",
    "        test_set=datasets[\"test\"],\n",
    "        min_key=\"WER\",\n",
    "        test_loader_kwargs=hparams[\"test_dataloader_opts\"],\n",
    "    )\n",
    "```\n",
    "\n",
    "For more information on how the Brain class works, [please see this tutorial](https://speechbrain.readthedocs.io/en/latest/tutorials/basics/brain-class.html)\n",
    "Note that the `fit` and `evaluate` methods take in input the dataset objects as well. From this dataset, a pytorch dataloader is created automatically. The latter creates the batches used for training and evaluation.\n",
    "\n",
    "When speech sentences with **different lengths** are sampled, zero-padding is performed. To keep track of the real length of each sentence within each batch, the dataloader returns a special tensor containing **relative lengths** as well. For instance, let's assume `batch.sig[0]` to be variable that contains the input waveform as a [batch, time] tensor:\n",
    "\n",
    "```\n",
    "tensor([[1, 1, 0, 0],\n",
    "        [1, 1, 1, 0],\n",
    "        [1, 1, 0, 0]])\n",
    "```\n",
    "The `batch.sig[1]` will contain the following relative lengths:\n",
    "\n",
    "```\n",
    "tensor([0.5000, 0.7500, 1.0000])\n",
    "```\n",
    "\n",
    "With this information, we can exclude zero-padded steps from some computations (e.g feature normalization, statistical pooling, loss, etc).\n",
    "\n",
    "### Why relative lengths instead of absolute lengths?\n",
    "\n",
    "The preference for relative lengths over absolute lengths stems from the dynamic nature of time resolution within a neural network. Several operations, including pooling, stride convolution, transposed convolution, FFT computation, and others, have the potential to alter the number of time steps in a sequence.\n",
    "\n",
    "By employing the relative position trick, the calculation of actual time steps at each stage of neural computations becomes more flexible. This is achieved by multiplying the relative length by the total length of the tensor. Consequently, the approach adapts to changes in time resolution introduced by various network operations, ensuring a more robust and adaptable representation of temporal information throughout the neural network's computations.\n",
    "\n",
    "\n",
    "#### Forward Computations\n",
    "In the Brain class we have to define some important methods such as:\n",
    "- `compute_forward`, that specifies all the computations needed to transform the input waveform into the output posterior probabilities)\n",
    "- `compute_objective`, which computes the loss function given the labels and the predictions performed by the model.\n",
    "\n",
    "Let's take a look into `compute_forward` first:\n",
    "\n",
    "\n",
    "```python\n",
    "    def compute_forward(self, batch, stage):\n",
    "        \"\"\"Runs all the computation of the CTC + seq2seq ASR. It returns the\n",
    "        posterior probabilities of the CTC and seq2seq networks.\n",
    "\n",
    "        Arguments\n",
    "        ---------\n",
    "        batch : PaddedBatch\n",
    "            This batch object contains all the relevant tensors for computation.\n",
    "        stage : sb.Stage\n",
    "            One of sb.Stage.TRAIN, sb.Stage.VALID, or sb.Stage.TEST.\n",
    "\n",
    "        Returns\n",
    "        -------\n",
    "        predictions : dict\n",
    "            At training time it returns predicted seq2seq log probabilities.\n",
    "            If needed it also returns the ctc output log probabilities.\n",
    "            At validation/test time, it returns the predicted tokens as well.\n",
    "        \"\"\"\n",
    "        # We first move the batch to the appropriate device.\n",
    "        batch = batch.to(self.device)\n",
    "\n",
    "        feats, self.feat_lens = self.prepare_features(stage, batch.sig)\n",
    "        tokens_bos, _ = self.prepare_tokens(stage, batch.tokens_bos)\n",
    "\n",
    "        # Running the encoder (prevent propagation to feature extraction)\n",
    "        encoded_signal = self.modules.encoder(feats.detach())\n",
    "\n",
    "        # Embed tokens and pass tokens & encoded signal to decoder\n",
    "        embedded_tokens = self.modules.embedding(tokens_bos.detach())\n",
    "        decoder_outputs, _ = self.modules.decoder(\n",
    "            embedded_tokens, encoded_signal, self.feat_lens\n",
    "        )\n",
    "\n",
    "        # Output layer for seq2seq log-probabilities\n",
    "        logits = self.modules.seq_lin(decoder_outputs)\n",
    "        predictions = {\"seq_logprobs\": self.hparams.log_softmax(logits)}\n",
    "\n",
    "        if self.is_ctc_active(stage):\n",
    "            # Output layer for ctc log-probabilities\n",
    "            ctc_logits = self.modules.ctc_lin(encoded_signal)\n",
    "            predictions[\"ctc_logprobs\"] = self.hparams.log_softmax(ctc_logits)\n",
    "\n",
    "        elif stage != sb.Stage.TRAIN:\n",
    "            if stage == sb.Stage.VALID:\n",
    "                hyps, _, _, _ = self.hparams.valid_search(\n",
    "                    encoded_signal, self.feat_lens\n",
    "                )\n",
    "            elif stage == sb.Stage.TEST:\n",
    "                hyps, _, _, _ = self.hparams.test_search(\n",
    "                    encoded_signal, self.feat_lens\n",
    "                )\n",
    "\n",
    "            predictions[\"tokens\"] = hyps\n",
    "\n",
    "        return predictions\n",
    "```\n",
    "\n",
    "\n",
    "The function takes the batch variable and the current stage (that can be `sb.Stage.TRAIN`, `sb.Stage.VALID`, or `sb.Stage.TEST`). We then put the batch on the right device, compute the features, and encode them with our CRDNN encoder.\n",
    "For more information on feature computation, [take a look into this tutorial](https://speechbrain.readthedocs.io/en/latest/tutorials/preprocessing/speech-features.html), while for more details on the speech augmentation [take a look here](https://speechbrain.readthedocs.io/en/latest/tutorials/preprocessing/speech-augmentation.html).\n",
    "After that, we feed our encoded states into an autoregressive attention-based decoder that performs some predictions over the tokens.\n",
    "At validation and test stages, we apply beamsearch on the top of the token predictions.\n",
    "Our system applies an additional CTC loss on the top of the encoder. The CTC can be turned off after N epochs if desired.\n",
    "\n",
    "\n",
    "#### Compute Objectives\n",
    "\n",
    "Let's take a look now into the compute_objectives function:\n",
    "\n",
    "\n",
    "\n",
    "```python\n",
    "\n",
    "    def compute_objectives(self, predictions, batch, stage):\n",
    "        \"\"\"Computes the loss given the predicted and targeted outputs. We here\n",
    "        do multi-task learning and the loss is a weighted sum of the ctc + seq2seq\n",
    "        costs.\n",
    "\n",
    "        Arguments\n",
    "        ---------\n",
    "        predictions : dict\n",
    "            The output dict from `compute_forward`.\n",
    "        batch : PaddedBatch\n",
    "            This batch object contains all the relevant tensors for computation.\n",
    "        stage : sb.Stage\n",
    "            One of sb.Stage.TRAIN, sb.Stage.VALID, or sb.Stage.TEST.\n",
    "\n",
    "        Returns\n",
    "        -------\n",
    "        loss : torch.Tensor\n",
    "            A one-element tensor used for backpropagating the gradient.\n",
    "        \"\"\"\n",
    "\n",
    "        # Compute sequence loss against targets with EOS\n",
    "        tokens_eos, tokens_eos_lens = self.prepare_tokens(\n",
    "            stage, batch.tokens_eos\n",
    "        )\n",
    "        loss = sb.nnet.losses.nll_loss(\n",
    "            log_probabilities=predictions[\"seq_logprobs\"],\n",
    "            targets=tokens_eos,\n",
    "            length=tokens_eos_lens,\n",
    "            label_smoothing=self.hparams.label_smoothing,\n",
    "        )\n",
    "\n",
    "        # Add ctc loss if necessary. The total cost is a weighted sum of\n",
    "        # ctc loss + seq2seq loss\n",
    "        if self.is_ctc_active(stage):\n",
    "            # Load tokens without EOS as CTC targets\n",
    "            tokens, tokens_lens = self.prepare_tokens(stage, batch.tokens)\n",
    "            loss_ctc = self.hparams.ctc_cost(\n",
    "                predictions[\"ctc_logprobs\"], tokens, self.feat_lens, tokens_lens\n",
    "            )\n",
    "            loss *= 1 - self.hparams.ctc_weight\n",
    "            loss += self.hparams.ctc_weight * loss_ctc\n",
    "\n",
    "        if stage != sb.Stage.TRAIN:\n",
    "            # Converted predicted tokens from indexes to words\n",
    "            predicted_words = [\n",
    "                self.hparams.tokenizer.decode_ids(prediction).split(\" \")\n",
    "                for prediction in predictions[\"tokens\"]\n",
    "            ]\n",
    "            target_words = [words.split(\" \") for words in batch.words]\n",
    "\n",
    "            # Monitor word error rate and character error rated at\n",
    "            # valid and test time.\n",
    "            self.wer_metric.append(batch.id, predicted_words, target_words)\n",
    "            self.cer_metric.append(batch.id, predicted_words, target_words)\n",
    "\n",
    "        return loss\n",
    "```\n",
    "\n",
    "Based on the predictions and the target we compute the Negative Log Likelihood  loss (NLL) and, if needed, the Connectionist Temporal Classification (CTC) one as well. The two losses are combined with a weight (ctc_weight). At validation or test stages,  we compute the word-error-rate (WER) and the character-error-rate (CER).\n",
    "\n",
    "### Other Methods\n",
    "In addition to the primary functions `forward` and `compute_objective`, the code includes `on_stage_start` and `on_stage_end` functions. The former initializes statistic objects, such as Word Error Rate (WER) and Character Error Rate (CER). The latter oversees several critical aspects:\n",
    "\n",
    "- **Statistics Updates:** Manages the updating of statistics during training.\n",
    "- **Learning Rate Annealing:** Handles the adjustment of learning rates over epochs.\n",
    "- **Logging:** Facilitates logging of crucial information during the training process.\n",
    "- **Checkpointing:** Manages the creation and storage of checkpoints for resumable training.\n",
    "\n",
    "By incorporating these functions, the code ensures a comprehensive and efficient training pipeline for the speech recognition system.\n",
    "\n",
    "\n",
    "That's all. You can just run the code and train your speech recognizer.\n",
    "\n",
    "\n",
    "## Pretrain and Fine-tune\n",
    "\n",
    "In scenarios where training from scratch might not be the optimal choice, the option to begin with a pre-trained model and fine-tune it becomes valuable.\n",
    "\n",
    "It's crucial to note that for this approach to work seamlessly, the architecture of your model must precisely match that of the pre-trained model.\n",
    "\n",
    "One convenient way to implement this is by utilizing the `pretrainer` class in the YAML file. If you aim to pretrain the encoder of the speech recognizer, the following code snippet can be employed:\n",
    "\n",
    "```yaml\n",
    "pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainer\n",
    " loadables:\n",
    "     encoder: !ref <encoder>\n",
    " paths:\n",
    "   encoder: !ref <encoder_ptfile>\n",
    "```\n",
    "\n",
    "Here, `!ref <encoder>` points to the encoder model defined earlier in the YAML file, while `encoder_ptfile` denotes the path where the pre-trained model is stored.\n",
    "\n",
    "To execute the pre-training process, ensure that you call the pre-trainer in the `train.py` file:\n",
    "\n",
    "```python\n",
    "run_on_main(hparams[\"pretrainer\"].collect_files)\n",
    "hparams[\"pretrainer\"].load_collected(device=run_opts[\"device\"])\n",
    "```\n",
    "\n",
    "It's essential to invoke this function before the `fit` method of the Brain class.\n",
    "\n",
    "For a more comprehensive understanding and practical examples, please refer to our [tutorial on pre-training and fine-tuning](https://speechbrain.readthedocs.io/en/latest/tutorials/advanced/pre-trained-models-and-fine-tuning-with-huggingface.html). This resource provides detailed insights into leveraging pre-trained models effectively in your speech recognition system.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "4LnRq1_cpPXZ"
   },
   "source": [
    "## Step 5: Inference\n",
    "\n",
    "At this point, we can use the trained speech recognizer. For this type of ASR model, speechbrain made available some classes ([take a look here](https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/inference/ASR.py)) such as the `EncoderDecoderASR` one that can make inference easier. For instance, we can transcribe an audio file with a pre-trained model hosted in our [HuggingFace repository](https://huggingface.co/speechbrain) in solely 4 lines of code:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "uvvY0dCbx5Sv"
   },
   "outputs": [],
   "source": [
    "from speechbrain.inference.ASR import EncoderDecoderASR\n",
    "\n",
    "asr_model = EncoderDecoderASR.from_hparams(source=\"speechbrain/asr-crdnn-rnnlm-librispeech\", savedir=\"/content/pretrained_model\")\n",
    "audio_file = 'speechbrain/asr-crdnn-rnnlm-librispeech/example.wav'\n",
    "asr_model.transcribe_file(audio_file)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "2Dyv9x10gGzV"
   },
   "source": [
    "But, how does this work with your custom ASR system?\n",
    "\n",
    "### Utilizing Your Custom Speech Recognizer\n",
    "\n",
    "At this point, you have two options for training and deploying your speech recognizer on your data:\n",
    "\n",
    "\n",
    "1. **Utilizing Available Interfaces (e.g., `EncoderDecoderASR`):**\n",
    "    - Considered the most elegant and convenient option.\n",
    "    - Your model should adhere to certain constraints to fit the proposed interface seamlessly.\n",
    "    - This approach streamlines the integration of your custom ASR model with existing interfaces, enhancing adaptability and maintainability.\n",
    "\n",
    "2. **Building Your Own Custom Interface:**\n",
    "    - Craft an interface tailored precisely to your custom ASR model.\n",
    "    - Provides the flexibility to address unique requirements and specifications.\n",
    "    - Ideal for scenarios where existing interfaces do not fully meet your needs.\n",
    "\n",
    "**Note:** These solutions are not exclusive to ASR and can be extended to other tasks such as speaker recognition and source separation.\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "J6N0Fb51pFnZ"
   },
   "source": [
    "#### Using the `EndoderDecoderASR` interface\n",
    "\n",
    "The EncoderDecoderASR class interface allows you to decouple your trained model from the training recipe and to infer (or encode) on any new audio file in few lines of code. The class has the following methods:\n",
    "\n",
    "- *encode_batch*: apply the encoder to an input batch and returns some encoded features.\n",
    "- *transcribe_file*: transcribes the single audio file in input.\n",
    "- *transcribe_batch*: transcribes the input batch.\n",
    "\n",
    "In fact, if you fulfill few constraints that we will detail in the next paragraph, you can simply do:\n",
    "\n",
    "```python\n",
    "from speechbrain.inference.ASR import EncoderDecoderASR\n",
    "\n",
    "asr_model = EncoderDecoderASR.from_hparams(source=\"your_local_folder\", hparams_file='your_file.yaml', savedir=\"pretrained_model\")\n",
    "audio_file = 'your_file.wav'\n",
    "asr_model.transcribe_file(audio_file)\n",
    "```\n",
    "\n",
    "Nevertheless, to allow such a generalization over all the possible EncoderDecoder ASR pipelines, you will have to consider a few constraints when deploying your system:\n",
    "\n",
    "1. **Necessary modules.** As you can see in the `EncoderDecoderASR` class, the modules defined in your yaml file MUST contain certain elements with specific names. In practice, you need a tokenizer, a decoder, and a decoder. The encoder can simply be a `speechbrain.nnet.containers.LengthsCapableSequential` composed with a sequence of features computation, normalization and model encoding.\n",
    "```python\n",
    "    HPARAMS_NEEDED = [\"tokenizer\"]\n",
    "    MODULES_NEEDED = [\n",
    "        \"encoder\",\n",
    "        \"decoder\",\n",
    "    ]\n",
    "```\n",
    "\n",
    "You also need to declare these entities in the YAML file and create the following dictionary called `modules`:\n",
    "\n",
    "```\n",
    "encoder: !new:speechbrain.nnet.containers.LengthsCapableSequential\n",
    "    input_shape: [null, null, !ref <n_mels>]\n",
    "    compute_features: !ref <compute_features>\n",
    "    normalize: !ref <normalize>\n",
    "    model: !ref <enc>\n",
    "\n",
    "decoder: !new:speechbrain.nnet.RNN.AttentionalRNNDecoder\n",
    "    enc_dim: !ref <dnn_neurons>\n",
    "    input_size: !ref <emb_size>\n",
    "    rnn_type: gru\n",
    "    attn_type: location\n",
    "    hidden_size: !ref <dec_neurons>\n",
    "    attn_dim: 1024\n",
    "    num_layers: 1\n",
    "    scaling: 1.0\n",
    "    channels: 10\n",
    "    kernel_size: 100\n",
    "    re_init: True\n",
    "    dropout: !ref <dropout>\n",
    "\n",
    "\n",
    "modules:\n",
    "    encoder: !ref <encoder>\n",
    "    decoder: !ref <decoder>\n",
    "    lm_model: !ref <lm_model>\n",
    "```\n",
    "\n",
    "In this case, `enc` is a CRDNN, but could be any custom neural network for instance.\n",
    "\n",
    "  **Why do you need to ensure this?** Well, it simply is because these are the modules we call when inferring on the `EncoderDecoderASR` class. Here is an example of the `encode_batch()` function.\n",
    "```python\n",
    "[...]\n",
    "  wavs = wavs.float()\n",
    "  wavs, wav_lens = wavs.to(self.device), wav_lens.to(self.device)\n",
    "  encoder_out = self.modules.encoder(wavs, wav_lens)\n",
    "return encoder_out\n",
    "```\n",
    "  **What if I have a complex asr_encoder structure with multiple deep neural networks and stuffs ?** Simply put everything in a torch.nn.ModuleList in your yaml:\n",
    "```yaml\n",
    "asr_encoder: !new:torch.nn.ModuleList\n",
    "    - [!ref <enc>, my_different_blocks ... ]\n",
    "```\n",
    "\n",
    "2. **Call to the pretrainer to load the checkpoints.** Finally, you need to define a call to the pretrainer that will load the different checkpoints of your trained model into the corresponding SpeechBrain modules. In short, it will load the weights of your encoder, language model or even simply load the tokenizer.\n",
    "```yaml\n",
    "pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainer\n",
    "    loadables:\n",
    "        asr: !ref <asr_model>\n",
    "        lm: !ref <lm_model>\n",
    "        tokenizer: !ref <tokenizer>\n",
    "    paths:\n",
    "      asr: !ref <asr_model_ptfile>\n",
    "      lm: !ref <lm_model_ptfile>\n",
    "      tokenizer: !ref <tokenizer_ptfile>\n",
    "```\n",
    "The loadable field creates a link between a file (e.g. `lm` that is related to the checkpoint in `<lm_model_ptfile>`) to a yaml instance (e.g. `<lm_model>`) that is nothing more than your lm.\n",
    "\n",
    "If you respect these two constraints, it should works! Here, we give a complete example of a yaml that is used for inference only:\n",
    "\n",
    "```yaml\n",
    "\n",
    "# ############################################################################\n",
    "# Model: E2E ASR with attention-based ASR\n",
    "# Encoder: CRDNN model\n",
    "# Decoder: GRU + beamsearch + RNNLM\n",
    "# Tokens: BPE with unigram\n",
    "# Authors:  Ju-Chieh Chou, Mirco Ravanelli, Abdel Heba, Peter Plantinga 2020\n",
    "# ############################################################################\n",
    "\n",
    "\n",
    "# Feature parameters\n",
    "sample_rate: 16000\n",
    "n_fft: 400\n",
    "n_mels: 40\n",
    "\n",
    "# Model parameters\n",
    "activation: !name:torch.nn.LeakyReLU\n",
    "dropout: 0.15\n",
    "cnn_blocks: 2\n",
    "cnn_channels: (128, 256)\n",
    "inter_layer_pooling_size: (2, 2)\n",
    "cnn_kernelsize: (3, 3)\n",
    "time_pooling_size: 4\n",
    "rnn_class: !name:speechbrain.nnet.RNN.LSTM\n",
    "rnn_layers: 4\n",
    "rnn_neurons: 1024\n",
    "rnn_bidirectional: True\n",
    "dnn_blocks: 2\n",
    "dnn_neurons: 512\n",
    "emb_size: 128\n",
    "dec_neurons: 1024\n",
    "output_neurons: 1000  # index(blank/eos/bos) = 0\n",
    "blank_index: 0\n",
    "\n",
    "# Decoding parameters\n",
    "bos_index: 0\n",
    "eos_index: 0\n",
    "min_decode_ratio: 0.0\n",
    "max_decode_ratio: 1.0\n",
    "beam_size: 80\n",
    "eos_threshold: 1.5\n",
    "using_max_attn_shift: True\n",
    "max_attn_shift: 240\n",
    "lm_weight: 0.50\n",
    "coverage_penalty: 1.5\n",
    "temperature: 1.25\n",
    "temperature_lm: 1.25\n",
    "\n",
    "normalize: !new:speechbrain.processing.features.InputNormalization\n",
    "    norm_type: global\n",
    "\n",
    "compute_features: !new:speechbrain.lobes.features.Fbank\n",
    "    sample_rate: !ref <sample_rate>\n",
    "    n_fft: !ref <n_fft>\n",
    "    n_mels: !ref <n_mels>\n",
    "\n",
    "enc: !new:speechbrain.lobes.models.CRDNN.CRDNN\n",
    "    input_shape: [null, null, !ref <n_mels>]\n",
    "    activation: !ref <activation>\n",
    "    dropout: !ref <dropout>\n",
    "    cnn_blocks: !ref <cnn_blocks>\n",
    "    cnn_channels: !ref <cnn_channels>\n",
    "    cnn_kernelsize: !ref <cnn_kernelsize>\n",
    "    inter_layer_pooling_size: !ref <inter_layer_pooling_size>\n",
    "    time_pooling: True\n",
    "    using_2d_pooling: False\n",
    "    time_pooling_size: !ref <time_pooling_size>\n",
    "    rnn_class: !ref <rnn_class>\n",
    "    rnn_layers: !ref <rnn_layers>\n",
    "    rnn_neurons: !ref <rnn_neurons>\n",
    "    rnn_bidirectional: !ref <rnn_bidirectional>\n",
    "    rnn_re_init: True\n",
    "    dnn_blocks: !ref <dnn_blocks>\n",
    "    dnn_neurons: !ref <dnn_neurons>\n",
    "\n",
    "emb: !new:speechbrain.nnet.embedding.Embedding\n",
    "    num_embeddings: !ref <output_neurons>\n",
    "    embedding_dim: !ref <emb_size>\n",
    "\n",
    "dec: !new:speechbrain.nnet.RNN.AttentionalRNNDecoder\n",
    "    enc_dim: !ref <dnn_neurons>\n",
    "    input_size: !ref <emb_size>\n",
    "    rnn_type: gru\n",
    "    attn_type: location\n",
    "    hidden_size: !ref <dec_neurons>\n",
    "    attn_dim: 1024\n",
    "    num_layers: 1\n",
    "    scaling: 1.0\n",
    "    channels: 10\n",
    "    kernel_size: 100\n",
    "    re_init: True\n",
    "    dropout: !ref <dropout>\n",
    "\n",
    "ctc_lin: !new:speechbrain.nnet.linear.Linear\n",
    "    input_size: !ref <dnn_neurons>\n",
    "    n_neurons: !ref <output_neurons>\n",
    "\n",
    "seq_lin: !new:speechbrain.nnet.linear.Linear\n",
    "    input_size: !ref <dec_neurons>\n",
    "    n_neurons: !ref <output_neurons>\n",
    "\n",
    "log_softmax: !new:speechbrain.nnet.activations.Softmax\n",
    "    apply_log: True\n",
    "\n",
    "lm_model: !new:speechbrain.lobes.models.RNNLM.RNNLM\n",
    "    output_neurons: !ref <output_neurons>\n",
    "    embedding_dim: !ref <emb_size>\n",
    "    activation: !name:torch.nn.LeakyReLU\n",
    "    dropout: 0.0\n",
    "    rnn_layers: 2\n",
    "    rnn_neurons: 2048\n",
    "    dnn_blocks: 1\n",
    "    dnn_neurons: 512\n",
    "    return_hidden: True  # For inference\n",
    "\n",
    "tokenizer: !new:sentencepiece.SentencePieceProcessor\n",
    "\n",
    "asr_model: !new:torch.nn.ModuleList\n",
    "    - [!ref <enc>, !ref <emb>, !ref <dec>, !ref <ctc_lin>, !ref <seq_lin>]\n",
    "\n",
    "# We compose the inference (encoder) pipeline.\n",
    "encoder: !new:speechbrain.nnet.containers.LengthsCapableSequential\n",
    "    input_shape: [null, null, !ref <n_mels>]\n",
    "    compute_features: !ref <compute_features>\n",
    "    normalize: !ref <normalize>\n",
    "    model: !ref <enc>\n",
    "\n",
    "ctc_scorer: !new:speechbrain.decoders.scorer.CTCScorer\n",
    "    eos_index: !ref <eos_index>\n",
    "    blank_index: !ref <blank_index>\n",
    "    ctc_fc: !ref <ctc_lin>\n",
    "\n",
    "coverage_scorer: !new:speechbrain.decoders.scorer.CoverageScorer\n",
    "    vocab_size: !ref <output_neurons>\n",
    "\n",
    "rnnlm_scorer: !new:speechbrain.decoders.scorer.RNNLMScorer\n",
    "    language_model: !ref <lm_model>\n",
    "    temperature: !ref <temperature_lm>\n",
    "\n",
    "scorer: !new:speechbrain.decoders.scorer.ScorerBuilder\n",
    "    scorer_beam_scale: 1.5\n",
    "    full_scorers: [\n",
    "        !ref <rnnlm_scorer>,\n",
    "        !ref <coverage_scorer>]\n",
    "    partial_scorers: [!ref <ctc_scorer>]\n",
    "    weights:\n",
    "        rnnlm: !ref <lm_weight>\n",
    "        coverage: !ref <coverage_penalty>\n",
    "        ctc: !ref <ctc_weight_decode>\n",
    "\n",
    "decoder: !new:speechbrain.decoders.S2SRNNBeamSearcher\n",
    "    embedding: !ref <emb>\n",
    "    decoder: !ref <dec>\n",
    "    linear: !ref <seq_lin>\n",
    "    bos_index: !ref <bos_index>\n",
    "    eos_index: !ref <eos_index>\n",
    "    min_decode_ratio: !ref <min_decode_ratio>\n",
    "    max_decode_ratio: !ref <max_decode_ratio>\n",
    "    beam_size: !ref <test_beam_size>\n",
    "    eos_threshold: !ref <eos_threshold>\n",
    "    using_max_attn_shift: !ref <using_max_attn_shift>\n",
    "    max_attn_shift: !ref <max_attn_shift>\n",
    "    temperature: !ref <temperature>\n",
    "    scorer: !ref <scorer>\n",
    "\n",
    "modules:\n",
    "    encoder: !ref <encoder>\n",
    "    decoder: !ref <decoder>\n",
    "    lm_model: !ref <lm_model>\n",
    "\n",
    "pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainer\n",
    "    loadables:\n",
    "        asr: !ref <asr_model>\n",
    "        lm: !ref <lm_model>\n",
    "        tokenizer: !ref <tokenizer>\n",
    "\n",
    "\n",
    "```\n",
    "\n",
    "As you can see, it is a standard YAMl file, but with a pretrainer that loads the model. It is similar to the yaml file used for training. We only have to remove all the parts that are training-specific (e.g, training parameters, optimizers, checkpointers, etc.) and add the pretrainer and `encoder`, `decoder` elements that links the needed modules with their pre-trained files.\n",
    "\n",
    "#### Developing your own inference interface\n",
    "\n",
    "While the `EncoderDecoderASR` class has been designed to be as generic as possible, your might require a more complex inference scheme that better fits your needs.  In this case, you have to develop your own interface. To do so, follow these steps:\n",
    "\n",
    "1. Create your custom interface inheriting from `Pretrained` (code [here](https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/inference/interfaces.py)):\n",
    "\n",
    "\n",
    "```python\n",
    "class MySuperTask(Pretrained):\n",
    "  # Here, do not hesitate to also add some required modules\n",
    "  # for further transparency.\n",
    "  HPARAMS_NEEDED = [\"mymodule1\", \"mymodule2\"]\n",
    "  MODULES_NEEDED = [\n",
    "        \"mytask_enc\",\n",
    "        \"my_searcher\",\n",
    "  ]\n",
    "  def __init__(self, *args, **kwargs):\n",
    "        super().__init__(*args, **kwargs)\n",
    "        # Do whatever is needed here w.r.t your system\n",
    "```\n",
    "\n",
    "This will enable your class to call useful functions such as `.from_hparams()` that fetches and loads based on a HyperPyYAML file, `load_audio()` that loads a given audio file.  Likely, most of the methods that we coded in the Pretrained class will fit your need. If not, you can override them to implement your custom functionality.\n",
    "\n",
    "\n",
    "2. Develop your interface and the different functionalities. Unfortunately, we can't provide a generic enough example here. You can add **any** function to this class that you think can make inference on your data/model easier and natural. For instance, we can create here a function that simply encodes a wav file using the `mytask_enc` module.\n",
    "```python\n",
    "class MySuperTask(Pretrained):\n",
    "  # Here, do not hesitate to also add some required modules\n",
    "  # for further transparency.\n",
    "  HPARAMS_NEEDED = [\"mymodule1\", \"mymodule2\"]\n",
    "  MODULES_NEEDED = [\n",
    "        \"mytask_enc\",\n",
    "        \"my_searcher\",\n",
    "  ]\n",
    "  def __init__(self, *args, **kwargs):\n",
    "        super().__init__(*args, **kwargs)\n",
    "        # Do whatever is needed here w.r.t your system\n",
    "  \n",
    "  def encode_file(self, path):\n",
    "        waveform = self.load_audio(path)\n",
    "        # Fake a batch:\n",
    "        batch = waveform.unsqueeze(0)\n",
    "        rel_length = torch.tensor([1.0])\n",
    "        with torch.no_grad():\n",
    "          rel_lens = rel_length.to(self.device)\n",
    "          encoder_out = self.encode_batch(waveform, rel_lens)\n",
    "        \n",
    "        return encode_file\n",
    "```\n",
    "\n",
    "Now, we can use your Interface in the following way:\n",
    "```python\n",
    "from speechbrain.pretrained import MySuperTask\n",
    "\n",
    "my_model = MySuperTask.from_hparams(source=\"your_local_folder\", hparams_file='your_file.yaml', savedir=\"pretrained_model\")\n",
    "audio_file = 'your_file.wav'\n",
    "encoded = my_model.encode_file(audio_file)\n",
    "\n",
    "```\n",
    "\n",
    "As you can see, this formalism is extremely flexible and enables you to create a holistic interface that can be used to do anything you want with your pretrained model.\n",
    "\n",
    "We provide different generic interfaces for E2E ASR, speaker recognition, source separation, speech enhancement, etc. Please have a look [here](https://github.com/speechbrain/speechbrain/blob/develop/recipes/CommonVoice/ASR/seq2seq/train.py) if interested!\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "z3pu0M42Pqju"
   },
   "source": [
    "## Customize your speech recognizer\n",
    "In a general case, you might have your own data and you would like to use your own model. Let's comment a bit more on how you can customize your recipe.\n",
    "\n",
    "**Suggestion**:  start from a recipe that is working (like the one used for this template) and only do the minimal modifications needed to customize it. Test your model step by step. Make sure your model can overfit on a tiny dataset composed of few sentences. If it doesn't overfit there is likely a bug in your model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "tImuOg5XP3CY"
   },
   "source": [
    "### Train with your data\n",
    "All you have to do when changing the dataset is to update the data preparation script such that we create the JSON files formatted as expected. The `train.py` script expects that the JSON file to be like this:\n",
    "\n",
    "\n",
    "\n",
    "```json\n",
    "{\n",
    "  \"1867-154075-0032\": {\n",
    "    \"wav\": \"{data_root}/LibriSpeech/train-clean-5/1867/154075/1867-154075-0032.flac\",\n",
    "    \"length\": 16.09,\n",
    "    \"words\": \"AND HE BRUSHED A HAND ACROSS HIS FOREHEAD AND WAS INSTANTLY HIMSELF CALM AND COOL VERY WELL THEN IT SEEMS I'VE MADE AN ASS OF MYSELF BUT I'LL TRY TO MAKE UP FOR IT NOW WHAT ABOUT CAROLINE\"\n",
    "  },\n",
    "  \"1867-154075-0001\": {\n",
    "    \"wav\": \"{data_root}/LibriSpeech/train-clean-5/1867/154075/1867-154075-0001.flac\",\n",
    "    \"length\": 14.9,\n",
    "    \"words\": \"THAT DROPPED HIM INTO THE COAL BIN DID HE GET COAL DUST ON HIS SHOES RIGHT AND HE DIDN'T HAVE SENSE ENOUGH TO WIPE IT OFF AN AMATEUR A RANK AMATEUR I TOLD YOU SAID THE MAN OF THE SNEER WITH SATISFACTION\"\n",
    "  },\n",
    "```\n",
    "\n",
    "You have to parse your dataset and create JSON files with a unique ID for each sentence, the path of the audio signal (wav), the length of the speech sentence in seconds (length), and the word transcriptions (\"words\"). That's all!\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "IVCCe6cXPzJ0"
   },
   "source": [
    "### Train with your own model\n",
    "At some point, you might have your own model and you would like to plug it into the speech recognition pipeline.\n",
    "For instance, you might wanna replace our CRDNN encoder with something different. To do that, you have to create your own class and specify there the list of computations for your neural network. You can take a look into the models already existing in [speechbrain.lobes.models](https://github.com/speechbrain/speechbrain/tree/develop/speechbrain/lobes/models). If your model is a plain pipeline of computations, you can use the [sequential container](https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/lobes/models/CRDNN.py#L14). If the model is a more complex chain of computations, you can create it as an instance of `torch.nn.Module` and define there the `__init__` and `forward` methods like [here](https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/lobes/models/Xvector.py#L18).\n",
    "\n",
    "Once you defined your model, you only have to declare it in the yaml file and use it in `train.py`\n",
    "\n",
    "\n",
    "**Important:**  \n",
    "When plugging a new model, you have to tune again the most important hyperparameters of the system (e.g, learning rate, batch size, and the architectural parameters) to make the it working well.\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "W4pPJ0k3lJZj"
   },
   "source": [
    "\n",
    "\n",
    "## Conclusion\n",
    "\n",
    "In this tutorial, we showed how to create an end-to-end speech recognizer from scratch using SpeechBrain. The proposed system contains all the basic ingredients to develop a state-of-the-art system (i.e., data augmentation, tokenization, language models, beamsearch, attention, etc)\n",
    "\n",
    "We described all the steps using a small dataset only. In a real case you have to train with much more data (see for instance our [LibriSpeech recipes](https://github.com/speechbrain/speechbrain/tree/develop/recipes/LibriSpeech))."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "P-Trg_abjUTd"
   },
   "source": [
    "## Related Tutorials\n",
    "1. [YAML hyperpatameter specification](https://speechbrain.readthedocs.io/en/latest/tutorials/basics/hyperpyyaml.html)\n",
    "2. [Brain Class](https://speechbrain.readthedocs.io/en/latest/tutorials/basics/brain-class.html)\n",
    "3. [Checkpointing](https://speechbrain.readthedocs.io/en/latest/tutorials/basics/checkpointing.html)\n",
    "4. [Data-io](https://speechbrain.readthedocs.io/en/latest/tutorials/basics/data-loading-pipeline.html)\n",
    "5. [Tokenizer](https://speechbrain.readthedocs.io/en/latest/tutorials/advanced/text-tokenizer.html)\n",
    "6. [Speech Features](https://speechbrain.readthedocs.io/en/latest/tutorials/preprocessing/speech-features.html)\n",
    "7. [Speech Augmentation](https://speechbrain.readthedocs.io/en/latest/tutorials/preprocessing/speech-augmentation.html)\n",
    "8. [Environmental Corruption](https://speechbrain.readthedocs.io/en/latest/tutorials/preprocessing/environmental-corruption.html)\n",
    "9. [MultiGPU Training](https://speechbrain.readthedocs.io/en/latest/multigpu.html)\n",
    "10. [Pretrain and Fine-tune](https://speechbrain.readthedocs.io/en/latest/tutorials/advanced/pre-trained-models-and-fine-tuning-with-huggingface.html)\n",
    "\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "sb_auto_footer",
    "tags": [
     "sb_auto_footer"
    ]
   },
   "source": [
    "## Citing SpeechBrain\n",
    "\n",
    "If you use SpeechBrain in your research or business, please cite it using the following BibTeX entry:\n",
    "\n",
    "```bibtex\n",
    "@misc{speechbrainV1,\n",
    "  title={Open-Source Conversational AI with {SpeechBrain} 1.0},\n",
    "  author={Mirco Ravanelli and Titouan Parcollet and Adel Moumen and Sylvain de Langen and Cem Subakan and Peter Plantinga and Yingzhi Wang and Pooneh Mousavi and Luca Della Libera and Artem Ploujnikov and Francesco Paissan and Davide Borra and Salah Zaiem and Zeyu Zhao and Shucong Zhang and Georgios Karakasidis and Sung-Lin Yeh and Pierre Champion and Aku Rouhe and Rudolf Braun and Florian Mai and Juan Zuluaga-Gomez and Seyed Mahed Mousavi and Andreas Nautsch and Xuechen Liu and Sangeet Sagar and Jarod Duret and Salima Mdhaffar and Gaelle Laperriere and Mickael Rouvier and Renato De Mori and Yannick Esteve},\n",
    "  year={2024},\n",
    "  eprint={2407.00463},\n",
    "  archivePrefix={arXiv},\n",
    "  primaryClass={cs.LG},\n",
    "  url={https://arxiv.org/abs/2407.00463},\n",
    "}\n",
    "@misc{speechbrain,\n",
    "  title={{SpeechBrain}: A General-Purpose Speech Toolkit},\n",
    "  author={Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and François Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio},\n",
    "  year={2021},\n",
    "  eprint={2106.04624},\n",
    "  archivePrefix={arXiv},\n",
    "  primaryClass={eess.AS},\n",
    "  note={arXiv:2106.04624}\n",
    "}\n",
    "```"
   ]
  }
 ],
 "metadata": {
  "accelerator": "GPU",
  "colab": {
   "gpuType": "T4",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
